Subscribe to Get All the Blog Posts and Colab Notebooks 

Learn the Basics of Text Vectorization, Create a Word Embedding Model trained with a Neural Network on IMDB Reviews Dataset, and Visualize it with TensorBoard Embedding Projector

Figure 1. Photo by Raphael Schaller on Unsplash

This is a follow-up tutorial prepared after Part I of the tutorial, Mastering Word Embeddings in 10 Minutes with TensorFlow, where we introduce several word vectorization concepts such as One Hot Encoding and Encoding with a Unique ID Value. I would highly recommend you to check this tutorial if you are new to natural language processing.

In Part II of the tutorial, we will vectorize our words and trained their values using the IMDB Reviews dataset. This tutorial is our own take on TensorFlow’s tutorial on word embedding. We will train a word embedding using a simple Keras model and the IMDB Reviews dataset. Then, we will visualize them using Embedding Projector.

Let’s start:

Create a New Google Colab Notebook

First of all, you need the environment to start coding. For the sake of simplicity, I recommend you work with Google Colab. It comes with all the libraries pre-installed, and you won’t have to worry about them. All you need is a Google account, and I am sure you have one. So, create a new Colab notebook (see Figure 2) and start coding.

  Figure 2: Create a New Google Colab Notebook

Initial Imports

We will start by importing TensorFlow and os libraries. We will use the os library for some directory level operations we will do below and the TensorFlow library for dataset loading, deep learning models, and text preprocessing.

Download the IMDB Reviews Dataset

IMDB Reviews Dataset is a large movie review dataset collected and prepared by Andrew L. Maas from the popular movie rating service, IMDB. The IMDB Reviews dataset is used for binary sentiment classification, whether a review is positive or negative. It contains 25,000 movie reviews for training and 25,000 for testing. All these 50,000 reviews are labeled data that may be used for supervised deep learning. Besides, there is an additional 50,000 unlabeled reviews that we will not use in this case study. In this case study, we will only use the training dataset.

We can download the dataset from Stanford’s relevant directory with tf.keras.utils.get_file function, as shown below:

Dataset Creation

We need a little bit of housekeeping to create a proper dataset. Let’s start with viewing our main directory with the

As you can see below, we have our train and test folders. For this study, we will only use the /train folder

Figure 3. The Content of Main Directory, “aclImdb”

With the following lines, let’s view what’s under the /train subdirectory:

and here it is:

Figure 4.a. The Content of Sub-Directory, “aclImdb/train”

We have reviews with negative sentiments and positive sentiments. Next step, we will remove theunsup folder, which contains unlabeled reviews. Since we are working on a supervised learning problem in this tutorial, we do not need it.

As you can see in Figure X, we removed the unsup folder thanks to theshutil library:

 Figure 4. b. The Content of Sub-Directory, “aclImdb/train” after we remove the “unsup” folder

Create the Dataset

Now that we cleaned our directory, we can create our Dataset object. For this, we can use thetf.keras.preprocessing.text_dataset_from_directory function. As the name suggests, the text_dataset_from_directory function allows us to create text datasets directly from a directory. We selected an 80/20 train and validation split, but feel free to play around by adjusting the validation_split argument.

As you can see in Figure 5, we have 20,000 reviews for training and 5,000 for validation.

 Figure 5. The volume of Our Train and Validation Dataset

Let’s check how our dataset looks by using the .take() function and run a for-loop. Note that our dataset is a TensorFlow Dataset object. It requires a little more effort to print out its elements. The following line does that:

And here is the results in Figure 6:

 Figure 6. The First Five Reviews from the Training Dataset with Their Sentiment Info in the Beginning

Configure the Dataset

Now, since we are in the realms of deep learning, optimization is essential for a bearable training experience. TensorFlow has an experimental tool that we can use to optimize the workload and shorten the time needed for preprocessing, training, and other parallel operations. We can optimize our pipeline with the following lines:

Text Preprocessing

Now that we created our dataset, it is time to process its elements so that our model can understand them.

Custom Standardization

We will create a custom string standardization function to make the best of standardization. Standardization can be described as a set of preprocessing operations for NLP studies, including lowercasing, tag removal, and punctuation stripping. In the below code, we are achieving exactly these:

Now our data will be more standardized with our custom function.

TextVectorization

Since we created our custom standardization function, we can pass it in the TextVectorization layer we import from TensorFlow. TextVectorization is a layer that we use to map our strings to integers. We will pass in our custom standardization function, we will use up to 10,000 unique words (vocabulary), and we will keep a maximum of 100 words for each review. Check the below lines:

We will remove the labels from the train dataset and call the .adapt() function to build the vocabulary to use later on. Note that we haven’t vectorized our dataset yet. Just created the vocabulary with the lines below:

Model Building and Training

We already processed our reviews, and it is time to create our model.

Model Creation

We will make the inial imports, which include Sequential API for model building and Embedding, GlobalAveragePooling, and Dense layers we will use in the model.

We set the embedding dimension to 16, so each word will have 16 representative values. We limit the vocabulary size to 10,000 in parallel with the code above.

We add the following layers to our Keras model:

1 — A TextVectorization layer for converting strings to integers;

2 — AEmbedding layer to convert integer values with 16-dimensional vectors;

3 — A Global Average Pooling 1D layer to resolve the issue of having reviews with different lengths;

4 — A Dense layer with 16 neurons with a relu activation layer

5 — A final Dense layer with 1 neuron to classify if the review has a positive or negative sentiment.

The following lines do all these:

Set Up Callbacks for TensorBoard

Since we want to see how our model evolves and performs over time, we will configure our callback settings with the following lines:

We will use these callbacks to visualize our model performance at each epoch using TensorBoard

Configure the Model

Then, we will configure our model with Adam as optimizer and Binary Crossentropy as loss function because it is a binary classification task and select accuracy as our performance metric.

Start the Training

Now that our model is configured, we can use .fit() function to start the training. We will run for 15 epochs and record the callbacks for TensorBoard.

Here is the screenshot of the training process, as shown in Figure :

Figure 7. Model Performance at Each Epoch

Visualize the Results

Now that we concluded our model training let’s do some visualization to understand better what we built and trained.

The Summary of the Model

We can easily see the summary of our model with the .summary() function, as shown below:

Figure 8 shows how our model looks and lists the number of parameters and output shape for each layer:

 Figure 8. Our Model Summary

Training Performance on TensorBoard

Let’s see how our mode evolved as it trained on the IMDB reviews dataset. We can use TensorFlow’s visualization kit, TensorBoard. TensorBoard can be used for several machine learning visualization tasks such as:

  • Tracking and visualization loss and accuracy measures
  • Visualizing the model graph
  • Viewing the evolution of weights, biases, and other tensor values
  • Displaying images, text, and audio data
  • and more…
 Figure 9. A Screenshot of Our Tensorboard Instance

In this tutorial, we will use %load_ext to load TensorBoard and view the logs. The lines above will run a small server within our cell to visualize our metric values over time.

As you can see on the left, our accuracy increases over time while our loss values decrease. Figure 9 shows that our model does what it is supposed to do because decreasing loss value means that our model is doing something to lower its mistakes: learning.

Visualization with Embedding Projector

Our model looks nice and it learned a lot in just 15 epochs. But, the main goal of this tutorial to create a word embedding. We will not predict review sentiments in this tutorial. Instead, we will visualize our word embedding cloud using Embedding Projector.

Embedding Projector is a tool built on top of TensorBoard. It is a useful tool to analyze data and visualize the position of embedding values relative to one another. Using Embedding Projector, we can graphically represent high dimensional embedding by simplifying them using algorithms like PCA.

Get the Vector Values and Vocabulary Data

We will start by getting our 16-dimensional embedding values for each word. Also, we will get a list of all these words we embedded. We can achieve these two tasks with the following code:

Let’s see how our word and its vector values look with a random example. We selected the word with index no. 500 and visualize it with the following code:

The vector values and the corresponding word for index no. 500 is shown in Figure 10:

 Figure 10. A Random Example of Word-Vector Pair

Feel free to change the index value to view other words with their vector values.

Save the Data to New Files

Now we have the entire list of words (vocabulary) with their corresponding 16-dimensional vector values. We will save word names to the metadata.tsv file and vector values to the vectors.tsv file. The following lines create new files, write our data to these new files, save the data, close the files, and download them to your local machine:

Load to Embedding Projector

Now we visit the Embedding Projector website:

Then, we click the “Load” button on the left to load our vectors.tsv and metadata.tsv files. Then, we can click anywhere outside of the popup window.

and, voilà!

 Figure 11. Our Word Embedding Trained on IMDB Reviews Dataset

Note that Embedding Projectors runs a PCA algorithm to reduce the 16-dimensional vector space into 3-dimensional since this is the only way to visualize it.

Congratulations

You have successfully built a neural network to train a word embedding model, and it takes a lot of effort to achieve this. Pat yourself on the back and keep improving yourself in the field of natural language processing, as there are many unsolved problems.

Subscribe To Our Newsletter

Subscribe To Our Newsletter

If you would like to have access to full codes of the Medium Posts on Google Colab and the rest of my latest content, just fill in the form below:

You have Successfully Subscribed!