Covering the Basics of Word Embedding, One Hot Encoding, Text Vectorization, Embedding Layers, and an Example Neural Network Architecture for NLP
Figure 1. Photo by Nick Hillier on Unsplash
Listen to the Audio Version
Word embedding is one of the most important concepts in Natural Language Processing (NLP). It is an NLP technique where words or phrases (i.e., strings) from a vocabulary are mapped to vectors of real numbers. The need to map strings into vectors of real numbers originated from computers’ inability to do operations with strings.
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.
There are several NLP techniques to convert strings into representative numbers, such as:
- One Hot Encoding
- Encoding with a Unique Number
- Word Embedding
Before diving into word embedding, let’s compare these three options to see why Word embedding is the best.
One Hot Encoding
One-hot encoding can be achieved by creating a vector of zeros with the length of the entire vocabulary. Then, we only place “one” in the index where the word is. For each word, we create the same vector. Below you can see an example of one hot encoding where we encode the sentence, “His dog is two years old”.
As the vocabulary size increases, so does the size of the vector. One hot encoding is usually regarded as an inefficient method to vectorize strings. As you can see in the example above, most of the values are zero. This approach is considered computationally complex in an unnecessary way.
Encoding with a Unique ID Number
Instead of creating a vector of zeros for each word in a sentence, you may choose to assign each word a unique ID number. Therefore, instead of using strings, you may assign 1 to “dog”, 2 to “his”, 3 to “is”, 4 to “old”, 5 to “two”, 6 to “years”. This way, we can represent our string, “His dog is two years old”, as [2, 1, 3, 5, 6, 4].
As you can see, it is much easier to create. However, it comes with a covenant: These numbers do not have any relational representation since the encoding is arbitrary. There is nothing that stops us from changing the encoding differently. Therefore, these values do not capture a relationship between words. The lack of relational representation makes it difficult for models to interpret these values since the value of the feature (the number assigned) does not mean anything. Therefore, encoding with a unique ID number is also not a very good idea to capture patterns.
Since one-hot encoding is inefficient and encoding with unique IDs does not offer relational representation, we have to rely on another method. This method is word embedding. As we already covered above, word embedding is the task of converting words into vectors.
As you can see above, you can actually see a relational representation of the words. We didn’t have this feature in encoding with a unique number. Furthermore, we can also represent each word with a less complex vector compared to one-hot encoding. The size of our final array is much smaller than a one-hot encoded vocabulary.
Deep Learning for the Vectorization
So, in an ideal world, our goal is to find the perfect vector weights for each word so that their relations can be properly represented. But how are we going to know whether a word is closely related to another? Of course, we have an intuition since we have been learning languages all of our lives. But, our model must receive the training it requires to analyze these words. Therefore, we need a machine learning problem. We can adopt supervised, unsupervised, or semi-supervised learning approaches for word embedding. In this post, we will tackle a supervised learning problem to vectorize our strings.
A Neural Network Architecture for Word Embedding
We need to create a neural network to find ideal vector values for each word. At this stage, what we need is the following:
1 — A Text Vectorization layer for converting Text to 1D Tensor of Integers
2 — An Embedding layer to convert 1D Tensors of Integers into dense vectors of fixed size.
3 — A fully connected neural network for backpropagation and cost function and other deep learning tasks
Text Vectorization Layer
Text vectorization is an experimental layer that offers a lot of value for the text preprocessing automation. The layer does the following procedures:
- Standardize each sentence with lowercasing and punctuation stripping
- Split each sentence into words
- Recombine words into tokens
- Index these tokens
- Transform each sentence using the index into a 1D tensor of integer token indices or float values.
Here is a straightforward example of how the TextVectorization layer works.
We first create a vocabulary in line with our examples above.
Then, we create a TextVectorization layer. Then, we create a dummy model with Keras Sequential API. Then, we feed several sentences to create a list of 1D tensors of integers:
The length of the 1D output tensors is three because we set it to three. Feel free to try other values.
The embedding layer has a simple capability:
It turns positive integers (indexes) into dense vectors of fixed size.
Let’s see it with a basic example:
I passed the output from the TextVectorization example as input and set the output dimension to two. Therefore, each of our input integers is now represented with a 2-dims vector. While our input shape is (3, 3), our output shape is (3, 3, 2).
By changing the output dim, you can easily generate a more complex vector space, which would increase the computational complexity, but can potentially capture more pattern.
A Set of Fully Connected Layers
After TextVectorization and Embedding layers, we end up with the desired vector space. But, these values are still very random. Therefore, we need an optimizer that calculates the cost function and adjusts the values with backpropagation. This optimizer should be run on top of a set of fully connected layers. In some situations, you can also add other types of layers such as LSTM, GRU, or Convolution layers. To keep it simple, I will refer to them as a set of fully connected layers.
Note: If your embedding output has -by any chance- variable length, make sure to add a Global Pooling Layer before the set of fully connected layers.
Using a set of fully connected layers, we can feed our strings against labels to adjust the vector values. For example, assume that we have a dataset of sentences with their sentiment labels (e.g., positive or negative). We can then vectorize and embed these sentences using a vocabulary we created from our unique words. We can then create a set of fully connected layers to predict whether their sentiment is positive or negative. We can use the labels to backpropagate our model to optimize these vectors. Our result would look something similar to this:
Roughly speaking, a neural network architecture for Word embedding would look like this:
As I mentioned above, a complex word embedding model would require other layers such as Global Pooling, LSTM, GRU, Convolution, and others. However, the structure remains the same.
Now that you have an idea about Word embedding, in Part II of this series, we can actually create our own word embedding using the IMDB Reviews dataset and visualize it as in Figure 9. Check out Part 2: