In the field of machine learning, a term we hear often is “embeddings.” But what exactly does that mean? In this article, we will explore the concept of embeddings, what they are, and why they are so important in machine learning applications.
An embedding is a vector representation of an abstract object in a multidimensional space. In other words, it is a technique that allows complex data, such as words, images or concepts, to be represented in the form of numerical vectors, so that they can be processed and interpreted by machine learning algorithms.
Embeddings are important because they allow machine learning models to work with complex data more efficiently and effectively. Transforming data into vector representations allows models to find hidden relationships and patterns in data, making it possible to analyze and predict complex phenomena.
Practical embeddings example
In Natural Language Processing (NPL) applications, the words or sentences that make up the text are represented in the form of numerical vectors in a multidimensional space. In this case we talk about “word embeddings”.
One-hot encoding
A classic example is “one-hot” encoding which allows each word to be represented with a numeric vector. The vector in question will have a length equal to the size of the dictionary of the language you want to use and each word in the dictionary is associated with the index of a column/cell.
For example, suppose we want to represent the sentence “the cat sleeps on the carpet”:
The coding of the words will therefore be:
- “the” -> [1, 0, 0, 0, 0, 0]
- “cat” -> [0, 1, 0, 0, 0, 0]
- “sleeps” -> [0, 0, 1, 0, 0, 0]
- “on” -> [0, 0, 0, 1, 0, 0]
- “the” -> [0, 0, 0, 0, 1, 0]
- “carpet” -> [0, 0, 0, 0, 0, 1]
Bag of Words encoding (BoW)
In the previous case we saw how to represent words using vectors but if we had to represent entire sentences the previous coding could be ineffective. One solution is that of the BoW algorithm which involves valuing each cell of the vector with 1 if the word is present in the sentence and with 0 if the word is not present. For example:
- “the cat sleeps on the carpet” -> [1, 1, 1, 1, 1, 1]
- “the cat sleeps” -> [1, 1, 1, 0, 0, 0]
More generally, each cell can contain the number of occurrences of the word within the sentence in order to manage even words repeated several times.
Dimensionality reduction
As you can easily imagine, if you wanted to apply the previous example to the complete dictionary of any language, you would obtain vectors with a very high dimensionality. There are several algorithms that allow you to overcome this problem but we will not analyze them in this article. For example, using techniques such as PCA, t-SNE, autoencoder and LDA, it is possible to reduce the size of embeddings while retaining most of the important information. This allows you to improve model performance, explore and interpret data more effectively, and gain a better understanding of the problem being addressed.
Use cases of embeddings
Vector representations can be of different types and their use can also be multiple. Below are some more common examples:
Word Embeddings:
- They represent words as numerical vectors in a multidimensional space.
- They capture the semantic relationships between words.
- Used in NLP applications such as natural language recognition, machine translation and text generation.
Sentence Embeddings:
- They represent sentences or sequences of words as numerical vectors.
- They can be obtained by combining word embeddings in the sentence.
- Used in NLP applications such as sentiment analysis, text clustering and semantic search.
Document Embeddings:
- They represent entire documents or long texts as numeric vectors.
- They can be obtained by combining sentence embeddings and using dimensionality reduction techniques.
- Used in NLP applications such as document clustering, information search and text classification.
Contextual Embeddings:
- They represent words or phrases considering their context in the text.
- They can be achieved using pre-trained language models such as BERT (Bidirectional Encoder Representations from Transformers) or GPT (Generative Pre-trained Transformer).
- They provide richer, more contextually sensible representations of words and sentences.
- Used in advanced NLP applications such as question answering, text generation and context-based natural language understanding.
Image Embeddings:
- They represent images as numerical vectors in a multidimensional space.
- They capture the visual and semantic characteristics of images.
- They can be achieved using pre-trained convolutional neural networks such as VGG, ResNet or Inception.
- Used in computer vision applications such as object recognition, image classification and similar image search.
Audio Embeddings:
- They represent audio signals as numerical vectors in a multidimensional space.
- They capture the acoustic and semantic characteristics of sound.
- They can be achieved using convolutional neural networks or recurrent neural networks trained on spectrograms or other audio representations.
- Used in sound processing applications such as speech recognition, audio transcription, and environmental sound analysis.
The concept of similarity in embeddings
The concept of similarity is fundamental in the analysis of embeddings, since it allows us to evaluate how similar or correlated two elements (words, images, objects, etc.) are in vector space. Similarity is often calculated using measures such as cosine similarity or Euclidean distance between embedding vectors.
Cosine Similarity:
- Cosine similarity measures the angle between two vectors in space.
- It is calculated by dividing the dot product of the two vectors by the product of their norms.
- Values closer to 1 indicate greater similarity, while values closer to 0 indicate lower similarity.
Euclidean distance:
- Euclidean distance measures the length of the vector connecting two points in space.
- It is calculated as the square root of the sum of the squares of the differences between the coordinates of the two vectors.
- Values closer to 0 indicate greater similarity, while larger values indicate less similarity.
Graphical representation and clustering
Graphically, embeddings can be represented in two-dimensional or three-dimensional space using dimension reduction techniques such as those mentioned above. In this representation, the points corresponding to the vectors are positioned in space so that their distances reflect the similarities between the elements.
For example, it is possible to train AI models on a sufficiently large and varied corpus of text in order to calculate embeddings of words that represent their meaning based on context. By doing this you can ensure that words with similar meanings have similar embeddings (the model learns to capture semantic relationships during training).
By representing word embeddings in a two-dimensional space using the PCA algorithm, words with similar embeddings will be placed close to each other, while words with different embeddings will be placed far from each other. This allows us to define clusterings as in the following example in which two “types” of similar words are clearly distinguished: animals and states:
Embeddings are a powerful and versatile tool in the field of machine learning, used in a wide range of applications to represent complex data in a vectorial way. Thanks to their ability to capture semantic meaning and relationships between data, embeddings allow machine learning models to understand and interpret the world around us more effectively and efficiently.