Learn the Basics of Text Vectorization, Create a Word Embedding Model trained with a Neural Network on IMDB Reviews Dataset, and Visualize it with TensorBoard Embedding Projector
This is a follow-up tutorial prepared after Part I of the tutorial, Mastering Word Embeddings in 10 Minutes with TensorFlow, where we introduce several word vectorization concepts such as One Hot Encoding and Encoding with a Unique ID Value. I would highly recommend you to check this tutorial if you are new to natural language processing.
Covering the Basics of Word Embedding, One Hot Encoding, Text Vectorization, Embedding Layers, and an Example Neural…towardsdatascience.com
In Part II of the tutorial, we will vectorize our words and trained their values using the IMDB Reviews dataset. This tutorial is our own take on TensorFlow’s tutorial on word embedding. We will train a word embedding using a simple Keras model and the IMDB Reviews dataset. Then, we will visualize them using Embedding Projector.
Create a New Google Colab Notebook
First of all, you need the environment to start coding. For the sake of simplicity, I recommend you work with Google Colab. It comes with all the libraries pre-installed, and you won’t have to worry about them. All you need is a Google account, and I am sure you have one. So, create a new Colab notebook (see Figure 2) and start coding.
We will start by importing TensorFlow and os libraries. We will use the
os library for some directory level operations we will do below and the
TensorFlow library for dataset loading, deep learning models, and text preprocessing.
Download the IMDB Reviews Dataset
IMDB Reviews Dataset is a large movie review dataset collected and prepared by Andrew L. Maas from the popular movie rating service, IMDB. The IMDB Reviews dataset is used for binary sentiment classification, whether a review is positive or negative. It contains 25,000 movie reviews for training and 25,000 for testing. All these 50,000 reviews are labeled data that may be used for supervised deep learning. Besides, there is an additional 50,000 unlabeled reviews that we will not use in this case study. In this case study, we will only use the training dataset.
We can download the dataset from Stanford’s relevant directory with
tf.keras.utils.get_file function, as shown below:
We need a little bit of housekeeping to create a proper dataset. Let’s start with viewing our main directory with the
As you can see below, we have our train and test folders. For this study, we will only use the
With the following lines, let’s view what’s under the
and here it is:
We have reviews with negative sentiments and positive sentiments. Next step, we will remove the
unsup folder, which contains unlabeled reviews. Since we are working on a supervised learning problem in this tutorial, we do not need it.
As you can see in Figure X, we removed the
unsup folder thanks to the
Create the Dataset
Now that we cleaned our directory, we can create our Dataset object. For this, we can use the
tf.keras.preprocessing.text_dataset_from_directory function. As the name suggests, the
text_dataset_from_directory function allows us to create text datasets directly from a directory. We selected an 80/20 train and validation split, but feel free to play around by adjusting the
As you can see in Figure 5, we have 20,000 reviews for training and 5,000 for validation.
Let’s check how our dataset looks by using the
.take() function and run a for-loop. Note that our dataset is a TensorFlow Dataset object. It requires a little more effort to print out its elements. The following line does that:
And here is the results in Figure 6:
Configure the Dataset
Now, since we are in the realms of deep learning, optimization is essential for a bearable training experience. TensorFlow has an experimental tool that we can use to optimize the workload and shorten the time needed for preprocessing, training, and other parallel operations. We can optimize our pipeline with the following lines:
Now that we created our dataset, it is time to process its elements so that our model can understand them.
We will create a custom string standardization function to make the best of standardization. Standardization can be described as a set of preprocessing operations for NLP studies, including lowercasing, tag removal, and punctuation stripping. In the below code, we are achieving exactly these:
Now our data will be more standardized with our custom function.
Since we created our custom standardization function, we can pass it in the
TextVectorization layer we import from TensorFlow.
TextVectorization is a layer that we use to map our strings to integers. We will pass in our custom standardization function, we will use up to 10,000 unique words (vocabulary), and we will keep a maximum of 100 words for each review. Check the below lines:
We will remove the labels from the train dataset and call the
.adapt() function to build the vocabulary to use later on. Note that we haven’t vectorized our dataset yet. Just created the vocabulary with the lines below:
Model Building and Training
We already processed our reviews, and it is time to create our model.
We will make the inial imports, which include Sequential API for model building and
Dense layers we will use in the model.
We set the embedding dimension to 16, so each word will have 16 representative values. We limit the vocabulary size to 10,000 in parallel with the code above.
We add the following layers to our Keras model:
1 — A
TextVectorization layer for converting strings to integers;
2 — A
Embedding layer to convert integer values with 16-dimensional vectors;
3 — A
Global Average Pooling 1D layer to resolve the issue of having reviews with different lengths;
4 — A
Dense layer with 16 neurons with a relu activation layer
5 — A final
Dense layer with 1 neuron to classify if the review has a positive or negative sentiment.
The following lines do all these:
Set Up Callbacks for TensorBoard
Since we want to see how our model evolves and performs over time, we will configure our callback settings with the following lines:
We will use these callbacks to visualize our model performance at each epoch using TensorBoard
Configure the Model
Then, we will configure our model with
Adam as optimizer and
Binary Crossentropy as loss function because it is a binary classification task and select
accuracy as our performance metric.
Start the Training
Now that our model is configured, we can use
.fit() function to start the training. We will run for 15 epochs and record the callbacks for TensorBoard.
Here is the screenshot of the training process, as shown in Figure :
Visualize the Results
Now that we concluded our model training let’s do some visualization to understand better what we built and trained.
The Summary of the Model
We can easily see the summary of our model with the
.summary() function, as shown below:
Figure 8 shows how our model looks and lists the number of parameters and output shape for each layer:
Training Performance on TensorBoard
Let’s see how our mode evolved as it trained on the IMDB reviews dataset. We can use TensorFlow’s visualization kit, TensorBoard. TensorBoard can be used for several machine learning visualization tasks such as:
- Tracking and visualization loss and accuracy measures
- Visualizing the model graph
- Viewing the evolution of weights, biases, and other tensor values
- Displaying images, text, and audio data
- and more…
In this tutorial, we will use
%load_ext to load TensorBoard and view the logs. The lines above will run a small server within our cell to visualize our metric values over time.
As you can see on the left, our accuracy increases over time while our loss values decrease. Figure 9 shows that our model does what it is supposed to do because decreasing loss value means that our model is doing something to lower its mistakes: learning.
Visualization with Embedding Projector
Our model looks nice and it learned a lot in just 15 epochs. But, the main goal of this tutorial to create a word embedding. We will not predict review sentiments in this tutorial. Instead, we will visualize our word embedding cloud using Embedding Projector.
Embedding Projector is a tool built on top of TensorBoard. It is a useful tool to analyze data and visualize the position of embedding values relative to one another. Using Embedding Projector, we can graphically represent high dimensional embedding by simplifying them using algorithms like PCA.
Get the Vector Values and Vocabulary Data
We will start by getting our 16-dimensional embedding values for each word. Also, we will get a list of all these words we embedded. We can achieve these two tasks with the following code:
Let’s see how our word and its vector values look with a random example. We selected the word with index no. 500 and visualize it with the following code:
The vector values and the corresponding word for index no. 500 is shown in Figure 10:
Feel free to change the index value to view other words with their vector values.
Save the Data to New Files
Now we have the entire list of words (vocabulary) with their corresponding 16-dimensional vector values. We will save word names to the metadata.tsv file and vector values to the vectors.tsv file. The following lines create new files, write our data to these new files, save the data, close the files, and download them to your local machine:
Load to Embedding Projector
Now we visit the Embedding Projector website:
Visualize high dimensional data.projector.tensorflow.org
Then, we click the “Load” button on the left to load our
metadata.tsv files. Then, we can click anywhere outside of the popup window.
Note that Embedding Projectors runs a PCA algorithm to reduce the 16-dimensional vector space into 3-dimensional since this is the only way to visualize it.
You have successfully built a neural network to train a word embedding model, and it takes a lot of effort to achieve this. Pat yourself on the back and keep improving yourself in the field of natural language processing, as there are many unsolved problems.