Subscribe to Get All the Blog Posts and Colab Notebooks 

Sentiment Analysis in 10 Minutes with Rule-Based VADER and NLTK

Sentiment Analysis in 10 Minutes with Rule-Based VADER and NLTK

Figure 1. Photo by Marian Kroell on Unsplash

Using the “Valence Aware Dictionary and sEntiment Reasoner” on the IMDB Reviews Dataset for Rule-based Sentiment Analysis

For a long time, I have been writing on statistical NLP topics and sharing tutorials. The sub-field of statistical NLP is responsible for several impressive advancements in the field of natural language processing, and it has the highest potential among competing approaches. However, in some cases, the contribution of rule-based classical natural language processing might be sought.

When to Use Rule-Based Approach Instead of Statistical NLP

In cases where researchers have deep pockets with lots of talented researchers and dealing with general problems, statistical NLP is usually the preferred way to tackle the NLP problem. But, in the following cases, the rule-based approach might be fruitful:

1 — Domain-Specific Problem:

We have great pre-trained models such as GPT-3, BERT, ELMo, which do wonders on generic language problems. However, when we try to use them in domain-specific problems such as financial news sentiment analysis or legal text classification, the specificity required for such tasks may not be satisfied by these state-of-the-art models. Therefore, we either have to fine-tune these models with additional labeled data or rely on rule-based models.

2 — Lack of Labeled Data:

Even though we might want to fine-tune a model, it may not always be possible. Especially, if you are with a small team or don’t have the funds to hire people via freelancing platforms such as Amazon Mechanical Turk, you cannot generate labeled data to fine-tune a pre-trained model, not to mention build your own deep learning model. Lastly, it may not be possible to collect a meaningful amount of data to train a deep learning model. In the end, statistical NLP models are very data-hungry.

3 — Limited Available Funding for Training:

Even though you have some available labeled specific data, training a dedicated model has its own cost. Not only that, you would need a group of star data scientists, but you also need distributed-servers to train your model, and your pockets may not be that deep.

If you have one of these issues, your best bet might be rule-based NLP, and the accuracy levels of rule-based NLPs are not as bad as you might think. In this post, we will build a simple Lexicon-based Sentiment Classifier without much tuning, and we will achieve an acceptable accuracy performance, which may be increased even further.

Before starting, though, let’s cover some basics:

What is a Lexicon?

Lexicon sounds like a fancy technical term, but it means a dictionary, usually in a particular domain. In other words:

A lexicon is the vocabulary of a person, language, or branch of knowledge.

In a rule-based NLP study for sentiment analysis, we need a lexicon that serves as a reference manual to measure the sentiment of a chunk of text (e.g., word, phrase, sentence, paragraph, full text). Lexicon-based sentiment analysis can be as simple as positive-labeled words minus negative-labeled words to see if a text has a positive sentiment. It can also be very complex with negation rules, distance calculations, added-variance, and several additional rules. One of the main differences between rule-based NLP and statistical NLP is that in rule-based NLP, the researcher is completely free to add any rule they deem useful. Therefore, in rule-based NLP, what we usually see is that highly trained experts develop theory-based rules in a particular domain and apply them to a particular problem in this particular domain.

What is VADER?

One of the most popular rule-based sentiment analysis models is VADER. VADER, or Valence Aware Dictionary and sEntiment Reasoner, is a lexicon and rule-based sentiment analysis tool specifically attuned to sentiments expressed in social media.

VADER is like the GPT-3 of Rule-Based NLP Models.

Since it is tuned for social media content, it performs best on the content you can find on social media. However, it still offers acceptable F1 Scores on other test sets, and provides a comparable performance compared to complex statistical models such as Support Vector Machines, as you can see below:

Figure 2. Three-class Accuracy (F1 scores) for Each Machine Trained Model (Figure from the paper)

Note that there are several alternative lexicons that you can use for your project, such as Harvard’s General Inquirer, Loughran McDonald, Hu & Liu. In this tutorial, we will adopt the VADER’s lexicon along with its methodology.

Now that you have a basic understanding of rule-based NLP models, we can proceed with our tutorial. This tutorial will approach a classic sentiment analysis problem from a rule-based NLP perspective: A Lexicon-based sentiment analysis on the IMDB Reviews Dataset.

Let’s start:

Sentiment Analysis on IMDB Dataset

What is IMDB Reviews Dataset?

IMDB Reviews Dataset is a large movie review dataset collected and prepared by Andrew L. Maas from the popular movie rating service, IMDB. The IMDB Reviews dataset is used for binary sentiment classification, whether a review is positive or negative. It contains 25,000 movie reviews for training and 25,000 for testing. All these 50,000 reviews are labeled data that may be used for supervised deep learning. Besides, there is an additional 50,000 unlabeled reviews that we will not use in this case study. In this case study, we will only use the training dataset.

Loading and Processing the Dataset

We will start by loading the IMDB dataset by using Keras’s Data API. However, Keras provides the dataset in the encoded version. Luckily we can also load the index dictionary to decode it to original reviews. The following lines will load the encoded reviews along with the index. We will also create the reverse index for decoding:

Before decoding the entire dataset, let’s see the operation with an example:

Output: this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came...

As you can see, we can decode our encoded reviews using the reversed index. If we can decode one review, for all the reviews, all we need is a for loop. With the code below, we will create a nested list in which we place the sentiment label and the decoded review text. We also need to do error handling due to a typo in the dataset (apparently Keras team encoded one of the words wrong :/). But, the code below handles this error as well:

Finally, we will create a pandas DataFrame from the nested list we created above:


Figure 3. Summary Info on Our IMDB Reviews Dataset | Figure 4. The first 10 Rows of our IMDB Reviews Dataset (Figures by Author) (Note that we skipped the single review with the incorrect encoding)

Now that our data is ready, we can load VADER.

Loading VADER Sentiment Intensity Analyzer

Python’s most advanced NLP tool, NLTK, provides a module for VADER, and we can easily import a SentimentIntensityAnalzer with the following code:

To test how our model works, let’s feed a simple sentence with neutral and negative words:

We use the polarity_score calculator of our SentimentIntensityAnalyzer model. This model gives us four scores: (i) Negativity, (ii) Positivity, (iii) Neutrality score of the sentence, and finally, (iv) Compound sentiment score of the sentence. The compound score is basically an aggregated version of the first three scores, and we will be using this score to measure the sentiment of our reviews. Here is the output of our dummy sentence: “Hello, world. I am terrible”.


Figure 5. VADER Polarity Scores of Our Dummy Sentence (Figure by Author)

Calculating Polarity Scores and Predicting:

Since we successfully calculated the sentiment score of a single sentence, all we have to do is run a loop for the entire dataset. Before calculating the score, although not necessary, I will shuffle the dataset first:

Instead of for loop, we will use a more efficient alternative: apply a lambda function on our dataset column Text and create a new column to save the results with the name Prediction. The following single line does these:

Editing Labels and Creating Accuracy Column:

The code above will simply convert negative compound scores to -1 and positive compound scores to 1. Since our IMDB Reviews dataset has 1 for positive sentiments and 0 for negative sentiments, we will change 0s with -1s so that we can calculate the accuracy of our predictions, which will be in a new column, called Accuracy:

Create a Column for Confusion Matrix

Finally, I want to create a confusion matrix to properly measure our rule-based NLP sentiment classifier’s success. A confusion matrix shows True Positives, True Negatives, False Positives, and False Negatives, which we can use to calculate Accuracy, Recall, Precision, and F1 Scores. With the lines below, I will create a custom function to generate confusion matrix tags and apply them as a lambda function, as I did above:

The code above will create a column, Conf_Matrix, and tag each prediction with abbreviations of True Positives, True Negatives, False Positives, and False Negatives.

Let’s see how the tail of our final DataFrame looks:



Figure 6. The Last 10 Rows of Our Final DataFrame (Figures by Author)

Calculating Accuracy, Recall, Precision, and F1 Score:

To see how we did with our VADER model, I will use several custom formulas to calculate Accuracy, Recall, Precision, and F1 Score. Although there are several API solutions for confusion matrix calculations, I decided to use custom calculations:

Thanks to the Conf_Matrix column, these calculations are straightforward to handle, and here are the results:


Figure 7. Performance Indicators of Lexicon-Based VADER Sentiment Classifier (Figures by Author)

As you can see, without a single second of training or customization, we achieved a 70% accuracy on the movie reviews dataset. Don’t forget that VADER is a social-media lexicon. Therefore, using a movie-reviews-based Lexicon would give us even higher performances.


You have successfully built a sentiment classifier that is based on rule-based NLP. One of the biggest advantages of rule-based NLP methods is that they are fully explainable as opposed to glamorous transformer-based NLP models such as BERT and GPT-3. Therefore, apart from its budget-friendly nature, the model’s explainability is another great reason to rely on rule-based NLP models, especially in sensitive areas.

Sentiment Analysis in 10 Minutes with BERT and TensorFlow

Sentiment Analysis in 10 Minutes with BERT and TensorFlow

Learn the basics of the pre-trained NLP model, BERT, and build a sentiment classifier using the IMDB movie reviews dataset, TensorFlow, and Hugging Face transformers

I prepared this tutorial because it is somehow very difficult to find a blog post with actual working BERT code from the beginning till the end. They are always full of bugs. So, I have dug into several articles, put together their codes, edited them, and finally have a working BERT model. So, just by running the code in this tutorial, you can actually create a BERT model and fine-tune it for sentiment analysis.


Figure 1. Photo by Lukas on Unsplash

Natural language processing (NLP) is one of the most cumbersome areas of artificial intelligence when it comes to data preprocessing. Apart from the preprocessing and tokenizing text datasets, it takes a lot of time to train successful NLP models. But today is your lucky day! We will build a sentiment classifier with a pre-trained NLP model: BERT.

What is BERT?

BERT stands for Bidirectional Encoder Representations from Transformers and it is a state-of-the-art machine learning model used for NLP tasks. Jacob Devlin and his colleagues developed BERT at Google in 2018. Devlin and his colleagues trained the BERT on English Wikipedia (2,500M words) and BooksCorpus (800M words) and achieved the best accuracies for some of the NLP tasks in 2018. There are two pre-trained general BERT variations: The base model is a 12-layer, 768-hidden, 12-heads, 110M parameter neural network architecture, whereas the large model is a 24-layer, 1024-hidden, 16-heads, 340M parameter neural network architecture. Figure 2 shows the visualization of the BERT network created by Devlin et al.



Figure 2. Overall pre-training and fine-tuning procedures for BERT (Figure from the BERT paper)

So, I don’t want to dive deep into BERT since we need a whole different post for that. In fact, I already scheduled a post aimed at comparing rival pre-trained NLP models. But, you will have to wait for a bit.

Additionally, I believe I should mention that although Open AI’s GPT3 outperforms BERT, the limited access to GPT3 forces us to use BERT. But rest assured, BERT is also an excellent NLP model. Here is a basic visual network comparison among rival NLP models: BERT, GPT, and ELMo:


Figure 3. Differences in pre-training model architectures of BERT, GPT, and ELMo (Figure from the BERT paper)

Installing Hugging Face Transformers Library

One of the questions that I had the most difficulty resolving was to figure out where to find the BERT model that I can use with TensorFlow. Finally, I discovered Hugging Face’s Transformers library.

Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.

We can easily load a pre-trained BERT from the Transformers library. But, make sure you install it since it is not pre-installed in the Google Colab notebook.

Sentiment Analysis with BERT

Now that we covered the basics of BERT and Hugging Face, we can dive into our tutorial. We will do the following operations to train a sentiment analysis model:

  • Install Transformers library;
  • Load the BERT Classifier and Tokenizer alıng with Input modules;
  • Download the IMDB Reviews Data and create a processed dataset (this will take several operations;
  • Configure the Loaded BERT model and Train for Fine-tuning
  • Make Predictions with the Fine-tuned Model

Let’s get started!

Note that I strongly recommend you to use a Google Colab notebook. If you want to learn more about how you will create a Google Colab notebook, check out this article:

Installing Transformers

Installing the Transformers library is fairly easy. Just run the following pip line on a Google Colab cell:

After the installation is completed, we will load the pre-trained BERT Tokenizer and Sequence Classifier as well as InputExample and InputFeatures. Then, we will build our model with the Sequence Classifier and our tokenizer with BERT’s Tokenizer.

Let’s see the summary of our BERT model:

Here are the results. We have the main BERT model, a dropout layer to prevent overfitting, and finally a dense layer for classification task:



Figure 4. Summary of BERT Model for Sentiment Classification

Now that we have our model, let’s create our input sequences from the IMDB reviews dataset:

IMDB Dataset

IMDB Reviews Dataset is a large movie review dataset collected and prepared by Andrew L. Maas from the popular movie rating service, IMDB. The IMDB Reviews dataset is used for binary sentiment classification, whether a review is positive or negative. It contains 25,000 movie reviews for training and 25,000 for testing. All these 50,000 reviews are labeled data that may be used for supervised deep learning. Besides, there is an additional 50,000 unlabeled reviews that we will not use in this case study. In this case study, we will only use the training dataset.

Initial Imports

We will first have two imports: TensorFlow and Pandas.

Get the Data from the Stanford Repo

Then, we can download the dataset from Stanford’s relevant directory with tf.keras.utils.get_file function, as shown below:

Remove Unlabeled Reviews

To remove the unlabeled reviews, we need the following operations. The comments below explain each operation:

Train and Test Split

Now that we have our data cleaned and prepared, we can create text_dataset_from_directory with the following lines. I want to process the entire data in a single batch. That’s why I selected a very large batch size:

Convert to Pandas to View and Process

Now we have our basic train and test datasets, I want to prepare them for our BERT model. To make it more comprehensible, I will create a pandas dataframe from our TensorFlow dataset object. The following code converts our train Dataset object to train pandas dataframe:

Here is the first 5 row of our dataset:



Figure 5. First 5 Row of Our Dataset

I will do the same operations for the test dataset with the following lines:

Creating Input Sequences

We have two pandas Dataframe objects waiting for us to convert them into suitable objects for the BERT model. We will take advantage of the InputExample function that helps us to create sequences from our dataset. The InputExample function can be called as follows:

Now we will create two main functions:

1 — convert_data_to_examples: This will accept our train and test datasets and convert each row into an InputExample object.

2 — convert_examples_to_tf_dataset: This function will tokenize the InputExample objects, then create the required input format with the tokenized objects, finally, create an input dataset that we can feed to the model.

We can call the functions we created above with the following lines:

Our dataset containing processed input sequences are ready to be fed to the model.

Configuring the BERT model and Fine-tuning

We will use Adam as our optimizer, CategoricalCrossentropy as our loss function, and SparseCategoricalAccuracy as our accuracy metric. Fine-tuning the model for 2 epochs will give us around 95% accuracy, which is great.

Training the model might take a while, so ensure you enabled the GPU acceleration from the Notebook Settings. After our training is completed, we can move onto making sentiment predictions.

Making Predictions

I created a list of two reviews I created. The first one is a positive review, while the second one is clearly negative.

We need to tokenize our reviews with our pre-trained BERT tokenizer. We will then feed these tokenized sequences to our model and run a final softmax layer to get the predictions. We can then use the argmax function to determine whether our sentiment prediction for the review is positive or negative. Finally, we will print out the results with a simple for loop. The following lines do all of these said operations:


Figure 6. Our Dummy Reviews with Their Predictions

Also, with the code above, you can predict as many reviews as possible.


You have successfully built a transformers network with a pre-trained BERT model and achieved ~95% accuracy on the sentiment analysis of the IMDB reviews dataset! If you are curious about saving your model, I would like to direct you to the Keras Documentation. After all, to efficiently use an API, one must learn how to read and use the documentation.

Mastering Word Embeddings in 10 Minutes with TensorFlow

Mastering Word Embeddings in 10 Minutes with TensorFlow

Covering the Basics of Word Embedding, One Hot Encoding, Text Vectorization, Embedding Layers, and an Example Neural Network Architecture for NLP

Figure 1. Photo by Nick Hillier on Unsplash

Listen to the Audio Version

Word embedding is one of the most important concepts in Natural Language Processing (NLP). It is an NLP technique where words or phrases (i.e., strings) from a vocabulary are mapped to vectors of real numbers. The need to map strings into vectors of real numbers originated from computers’ inability to do operations with strings.

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

There are several NLP techniques to convert strings into representative numbers, such as:

  • One Hot Encoding
  • Encoding with a Unique Number
  • Word Embedding

Before diving into word embedding, let’s compare these three options to see why Word embedding is the best.

One Hot Encoding

One-hot encoding can be achieved by creating a vector of zeros with the length of the entire vocabulary. Then, we only place “one” in the index where the word is. For each word, we create the same vector. Below you can see an example of one hot encoding where we encode the sentence, “His dog is two years old”.

 Figure 2. One Hot Encoding for “His dog is two years old” (Figure by Author)

As the vocabulary size increases, so does the size of the vector. One hot encoding is usually regarded as an inefficient method to vectorize strings. As you can see in the example above, most of the values are zero. This approach is considered computationally complex in an unnecessary way.

Encoding with a Unique ID Number

Instead of creating a vector of zeros for each word in a sentence, you may choose to assign each word a unique ID number. Therefore, instead of using strings, you may assign 1 to “dog”, 2 to “his”, 3 to “is”, 4 to “old”, 5 to “two”, 6 to “years”. This way, we can represent our string, “His dog is two years old”, as [2, 1, 3, 5, 6, 4].

 Figure 3. Encoding with a Unique ID Number for “His dog is two years old” (Figure by Author)

As you can see, it is much easier to create. However, it comes with a covenant: These numbers do not have any relational representation since the encoding is arbitrary. There is nothing that stops us from changing the encoding differently. Therefore, these values do not capture a relationship between words. The lack of relational representation makes it difficult for models to interpret these values since the value of the feature (the number assigned) does not mean anything. Therefore, encoding with a unique ID number is also not a very good idea to capture patterns.

Word Embedding

Since one-hot encoding is inefficient and encoding with unique IDs does not offer relational representation, we have to rely on another method. This method is word embedding. As we already covered above, word embedding is the task of converting words into vectors.

 Figure 4. A Basic 3-Dimensional Word Embedding for “His dog is” (Figure by Author)

As you can see above, you can actually see a relational representation of the words. We didn’t have this feature in encoding with a unique number. Furthermore, we can also represent each word with a less complex vector compared to one-hot encoding. The size of our final array is much smaller than a one-hot encoded vocabulary.

Deep Learning for the Vectorization

So, in an ideal world, our goal is to find the perfect vector weights for each word so that their relations can be properly represented. But how are we going to know whether a word is closely related to another? Of course, we have an intuition since we have been learning languages all of our lives. But, our model must receive the training it requires to analyze these words. Therefore, we need a machine learning problem. We can adopt supervised, unsupervised, or semi-supervised learning approaches for word embedding. In this post, we will tackle a supervised learning problem to vectorize our strings.

A Neural Network Architecture for Word Embedding

We need to create a neural network to find ideal vector values for each word. At this stage, what we need is the following:

1 — A Text Vectorization layer for converting Text to 1D Tensor of Integers

2 — An Embedding layer to convert 1D Tensors of Integers into dense vectors of fixed size.

3 — A fully connected neural network for backpropagation and cost function and other deep learning tasks

Text Vectorization Layer

Text vectorization is an experimental layer that offers a lot of value for the text preprocessing automation. The layer does the following procedures:

  • Standardize each sentence with lowercasing and punctuation stripping
  • Split each sentence into words
  • Recombine words into tokens
  • Index these tokens
  • Transform each sentence using the index into a 1D tensor of integer token indices or float values.

Here is a straightforward example of how the TextVectorization layer works.

We first create a vocabulary in line with our examples above.

 Figure 5. The Vocabulary of TextVectorization Layer Example

Then, we create a TextVectorization layer. Then, we create a dummy model with Keras Sequential API. Then, we feed several sentences to create a list of 1D tensors of integers:

 Figure 6. The I/O of the Operations with TextVectorization Layer

The length of the 1D output tensors is three because we set it to three. Feel free to try other values.

Embedding Layer

The embedding layer has a simple capability:

It turns positive integers (indexes) into dense vectors of fixed size.

Let’s see it with a basic example:

I passed the output from the TextVectorization example as input and set the output dimension to two. Therefore, each of our input integers is now represented with a 2-dims vector. While our input shape is (3, 3), our output shape is (3, 3, 2).

 Figure 7. The I/O of the Operations with Embedding Layer

By changing the output dim, you can easily generate a more complex vector space, which would increase the computational complexity, but can potentially capture more pattern.

A Set of Fully Connected Layers

 Figure 8. A Basic Example of A Set of Fully Connected Layers (Figure by Author)

After TextVectorization and Embedding layers, we end up with the desired vector space. But, these values are still very random. Therefore, we need an optimizer that calculates the cost function and adjusts the values with backpropagation. This optimizer should be run on top of a set of fully connected layers. In some situations, you can also add other types of layers such as LSTM, GRU, or Convolution layers. To keep it simple, I will refer to them as a set of fully connected layers.

Note: If your embedding output has -by any chance- variable length, make sure to add a Global Pooling Layer before the set of fully connected layers.

Using a set of fully connected layers, we can feed our strings against labels to adjust the vector values. For example, assume that we have a dataset of sentences with their sentiment labels (e.g., positive or negative). We can then vectorize and embed these sentences using a vocabulary we created from our unique words. We can then create a set of fully connected layers to predict whether their sentiment is positive or negative. We can use the labels to backpropagate our model to optimize these vectors. Our result would look something similar to this:

 Figure 9. Word Embedding created using Word2Vec (Figure by Author on Embedding Projector)

Final Notes

Roughly speaking, a neural network architecture for Word embedding would look like this:

 Figure 10. A Basic Artificial Neural Network Architecture for Word embedding (Figure by Author)

As I mentioned above, a complex word embedding model would require other layers such as Global Pooling, LSTM, GRU, Convolution, and others. However, the structure remains the same.

Now that you have an idea about Word embedding, in Part II of this series, we can actually create our own word embedding using the IMDB Reviews dataset and visualize it as in Figure 9. Check out Part 2:

Mastering Word Embeddings in 10 Minutes with IMDB Reviews

Mastering Word Embeddings in 10 Minutes with IMDB Reviews

Learn the Basics of Text Vectorization, Create a Word Embedding Model trained with a Neural Network on IMDB Reviews Dataset, and Visualize it with TensorBoard Embedding Projector

Figure 1. Photo by Raphael Schaller on Unsplash

This is a follow-up tutorial prepared after Part I of the tutorial, Mastering Word Embeddings in 10 Minutes with TensorFlow, where we introduce several word vectorization concepts such as One Hot Encoding and Encoding with a Unique ID Value. I would highly recommend you to check this tutorial if you are new to natural language processing.

In Part II of the tutorial, we will vectorize our words and trained their values using the IMDB Reviews dataset. This tutorial is our own take on TensorFlow’s tutorial on word embedding. We will train a word embedding using a simple Keras model and the IMDB Reviews dataset. Then, we will visualize them using Embedding Projector.

Let’s start:

Create a New Google Colab Notebook

First of all, you need the environment to start coding. For the sake of simplicity, I recommend you work with Google Colab. It comes with all the libraries pre-installed, and you won’t have to worry about them. All you need is a Google account, and I am sure you have one. So, create a new Colab notebook (see Figure 2) and start coding.

  Figure 2: Create a New Google Colab Notebook

Initial Imports

We will start by importing TensorFlow and os libraries. We will use the os library for some directory level operations we will do below and the TensorFlow library for dataset loading, deep learning models, and text preprocessing.

Download the IMDB Reviews Dataset

IMDB Reviews Dataset is a large movie review dataset collected and prepared by Andrew L. Maas from the popular movie rating service, IMDB. The IMDB Reviews dataset is used for binary sentiment classification, whether a review is positive or negative. It contains 25,000 movie reviews for training and 25,000 for testing. All these 50,000 reviews are labeled data that may be used for supervised deep learning. Besides, there is an additional 50,000 unlabeled reviews that we will not use in this case study. In this case study, we will only use the training dataset.

We can download the dataset from Stanford’s relevant directory with tf.keras.utils.get_file function, as shown below:

Dataset Creation

We need a little bit of housekeeping to create a proper dataset. Let’s start with viewing our main directory with the

As you can see below, we have our train and test folders. For this study, we will only use the /train folder

Figure 3. The Content of Main Directory, “aclImdb”

With the following lines, let’s view what’s under the /train subdirectory:

and here it is:

Figure 4.a. The Content of Sub-Directory, “aclImdb/train”

We have reviews with negative sentiments and positive sentiments. Next step, we will remove theunsup folder, which contains unlabeled reviews. Since we are working on a supervised learning problem in this tutorial, we do not need it.

As you can see in Figure X, we removed the unsup folder thanks to theshutil library:

 Figure 4. b. The Content of Sub-Directory, “aclImdb/train” after we remove the “unsup” folder

Create the Dataset

Now that we cleaned our directory, we can create our Dataset object. For this, we can use thetf.keras.preprocessing.text_dataset_from_directory function. As the name suggests, the text_dataset_from_directory function allows us to create text datasets directly from a directory. We selected an 80/20 train and validation split, but feel free to play around by adjusting the validation_split argument.

As you can see in Figure 5, we have 20,000 reviews for training and 5,000 for validation.

 Figure 5. The volume of Our Train and Validation Dataset

Let’s check how our dataset looks by using the .take() function and run a for-loop. Note that our dataset is a TensorFlow Dataset object. It requires a little more effort to print out its elements. The following line does that:

And here is the results in Figure 6:

 Figure 6. The First Five Reviews from the Training Dataset with Their Sentiment Info in the Beginning

Configure the Dataset

Now, since we are in the realms of deep learning, optimization is essential for a bearable training experience. TensorFlow has an experimental tool that we can use to optimize the workload and shorten the time needed for preprocessing, training, and other parallel operations. We can optimize our pipeline with the following lines:

Text Preprocessing

Now that we created our dataset, it is time to process its elements so that our model can understand them.

Custom Standardization

We will create a custom string standardization function to make the best of standardization. Standardization can be described as a set of preprocessing operations for NLP studies, including lowercasing, tag removal, and punctuation stripping. In the below code, we are achieving exactly these:

Now our data will be more standardized with our custom function.


Since we created our custom standardization function, we can pass it in the TextVectorization layer we import from TensorFlow. TextVectorization is a layer that we use to map our strings to integers. We will pass in our custom standardization function, we will use up to 10,000 unique words (vocabulary), and we will keep a maximum of 100 words for each review. Check the below lines:

We will remove the labels from the train dataset and call the .adapt() function to build the vocabulary to use later on. Note that we haven’t vectorized our dataset yet. Just created the vocabulary with the lines below:

Model Building and Training

We already processed our reviews, and it is time to create our model.

Model Creation

We will make the inial imports, which include Sequential API for model building and Embedding, GlobalAveragePooling, and Dense layers we will use in the model.

We set the embedding dimension to 16, so each word will have 16 representative values. We limit the vocabulary size to 10,000 in parallel with the code above.

We add the following layers to our Keras model:

1 — A TextVectorization layer for converting strings to integers;

2 — AEmbedding layer to convert integer values with 16-dimensional vectors;

3 — A Global Average Pooling 1D layer to resolve the issue of having reviews with different lengths;

4 — A Dense layer with 16 neurons with a relu activation layer

5 — A final Dense layer with 1 neuron to classify if the review has a positive or negative sentiment.

The following lines do all these:

Set Up Callbacks for TensorBoard

Since we want to see how our model evolves and performs over time, we will configure our callback settings with the following lines:

We will use these callbacks to visualize our model performance at each epoch using TensorBoard

Configure the Model

Then, we will configure our model with Adam as optimizer and Binary Crossentropy as loss function because it is a binary classification task and select accuracy as our performance metric.

Start the Training

Now that our model is configured, we can use .fit() function to start the training. We will run for 15 epochs and record the callbacks for TensorBoard.

Here is the screenshot of the training process, as shown in Figure :

Figure 7. Model Performance at Each Epoch

Visualize the Results

Now that we concluded our model training let’s do some visualization to understand better what we built and trained.

The Summary of the Model

We can easily see the summary of our model with the .summary() function, as shown below:

Figure 8 shows how our model looks and lists the number of parameters and output shape for each layer:

 Figure 8. Our Model Summary

Training Performance on TensorBoard

Let’s see how our mode evolved as it trained on the IMDB reviews dataset. We can use TensorFlow’s visualization kit, TensorBoard. TensorBoard can be used for several machine learning visualization tasks such as:

  • Tracking and visualization loss and accuracy measures
  • Visualizing the model graph
  • Viewing the evolution of weights, biases, and other tensor values
  • Displaying images, text, and audio data
  • and more…
 Figure 9. A Screenshot of Our Tensorboard Instance

In this tutorial, we will use %load_ext to load TensorBoard and view the logs. The lines above will run a small server within our cell to visualize our metric values over time.

As you can see on the left, our accuracy increases over time while our loss values decrease. Figure 9 shows that our model does what it is supposed to do because decreasing loss value means that our model is doing something to lower its mistakes: learning.

Visualization with Embedding Projector

Our model looks nice and it learned a lot in just 15 epochs. But, the main goal of this tutorial to create a word embedding. We will not predict review sentiments in this tutorial. Instead, we will visualize our word embedding cloud using Embedding Projector.

Embedding Projector is a tool built on top of TensorBoard. It is a useful tool to analyze data and visualize the position of embedding values relative to one another. Using Embedding Projector, we can graphically represent high dimensional embedding by simplifying them using algorithms like PCA.

Get the Vector Values and Vocabulary Data

We will start by getting our 16-dimensional embedding values for each word. Also, we will get a list of all these words we embedded. We can achieve these two tasks with the following code:

Let’s see how our word and its vector values look with a random example. We selected the word with index no. 500 and visualize it with the following code:

The vector values and the corresponding word for index no. 500 is shown in Figure 10:

 Figure 10. A Random Example of Word-Vector Pair

Feel free to change the index value to view other words with their vector values.

Save the Data to New Files

Now we have the entire list of words (vocabulary) with their corresponding 16-dimensional vector values. We will save word names to the metadata.tsv file and vector values to the vectors.tsv file. The following lines create new files, write our data to these new files, save the data, close the files, and download them to your local machine:

Load to Embedding Projector

Now we visit the Embedding Projector website:

Then, we click the “Load” button on the left to load our vectors.tsv and metadata.tsv files. Then, we can click anywhere outside of the popup window.

and, voilà!

 Figure 11. Our Word Embedding Trained on IMDB Reviews Dataset

Note that Embedding Projectors runs a PCA algorithm to reduce the 16-dimensional vector space into 3-dimensional since this is the only way to visualize it.


You have successfully built a neural network to train a word embedding model, and it takes a lot of effort to achieve this. Pat yourself on the back and keep improving yourself in the field of natural language processing, as there are many unsolved problems.

Kaggle’s Titanic Competition in 10 Minutes | Part-III

Kaggle’s Titanic Competition in 10 Minutes | Part-III

Using Natural Language Processing (NLP), Deep Learning, and GridSearchCV in Kaggle’s Titanic Competition | Machine Learning Tutorials

Figure 1. Titanic Under Construction on Unsplash

If you follow my tutorial series on Kaggle’s Titanic Competition (Part-I and Part-II) or have already participated in the Competition, you are familiar with the whole story. If you are not familiar with it, since this is a follow-up tutorial, I strongly recommend you to check out the Competition Page or Part-I and Part-II of this tutorial series. In Part-III (Final) of the series, (i) we will use natural language processing (NLP) techniques to obtain the titles of the passengers, (ii) create an Artificial Neural Network (ANN or RegularNet) to train the model, and (iii) use Grid Search Cross-Validation to tune the ANN so that we get the best results.

Let’s start!


Part-I of the Tutorial

Throughout this tutorial series, we try to keep things simple and develop the story slowly and clearly. In Part-I of the tutorial, we learned to write a python program with less than 20 lines to enter the Kaggle’s Competition. Things were kept as simple as possible. We cleaned the non-numerical parts, took care of the null values, trained our model using the train.csv file, predicted the passenger’s survival in the test.csv file, and saved it as a CSV file for submission.

Part-II of the Tutorial

Since we did not explore the dataset properly in Part-I, we focus on data exploration in Part-II using Matplotlib and Seaborn. We impute the null values instead of dropping them by using aggregated functions, better cleaned the data, and finally generated the dummy variables from the categorical variables. Then, we use a RandomForestClassifier model instead of LogisticRegression, which also improves precision. We achieve an approximately 20% increase in precision compared to the model in Part-I.

Part-III of the Tutorial

Figure 2. A Diagram of an Artificial Neural Network with one Hidden Layer (Figure by Author)

We will now use the Name column to derive the passengers’ titles, which played a significant role in their survival chances. We will also create an Artificial Neural Network (ANN or RegularNets) with Keras to obtain better results. Finally, to tune the ANN model, we will use GridSearchCV to detect the best parameters. Finally, we will generate a new CSV file for submission.

Preparing the Dataset

Like what we have done in Part-I and Part-II, I will start cleaning data and imputing the null values. This time, we will adopt a different approach and combine the two datasets for cleaning and imputing. We already covered why we impute the null values the way we do in Part-II; therefore, we will give you the code straight away. If you feel that some operations do not make sense, you may refer to Part-II or comment below. However, since we saw in Part-II that people younger than 18 had a greater chance of survival, we should add a new feature to measure this effect.

Data Cleaning and Null Value Imputation

Deriving Passenger Titles with NLP

We will drop the unnecessary columns and generate the dummy variables from the categorical variables above. But first, we need to extract the titles from the ‘Name’ column. To understand what we are doing, we will start by running the following code to get the first 10 rows Name column values.

Name Column Values of the First 10 Rows

And here is what we get:

Figure 3. Name Column Values of the First 10 Row (Figure by Author)

The structure of the name column value is as follows:


Therefore, we need to split these String based on the dot and comma and extract the title. We can accomplish this with the following code:

Splitting the Name Values to Extract Titles

Once we run this code, we will have a Title column with titles in it. To be able to see what kind of titles do we have, we will run this:

Grouping Titles and Get the Counts

Figure 4. Unique Title Counts (Figure by Author)

It seems that we have four major groups: ‘Mr’, ‘Mrs’, ‘Miss’, ‘Master’, and others. However, before grouping all the other titles as Others, we need to take care of the French titles. We need to convert them to their corresponding English titles with the following code:

French to English Title Converter

Now, we only have officers and royal titles. It makes sense to combine them as Others. We can achieve this with the following code:

Combining All the Non-Major Titles as Others (Contains Officer and Royal Titles)

Figure 5. Final Unique Title Counts (Figure by Author)

Final Touch on Data Preparation

Now that our Titles are more manageable, we can create dummies and drop the unnecessary columns with the following code:

Final Touch on Data Preparation

Creating an Artificial Neural Network for Training

Figure 6. A Diagram of an Artificial Neural Network with Two Hidden Layers (Figure by Author)

Standardizing Our Data with Standard Scaler

To get a good result, we must scale our data by using Scikit Learn’s Standard Scaler. Standard Scaler standardizes features by removing the mean and scaling to unit variance (i.e., standardization), which is different than MinMaxScaler. The mathematical difference between Standardization and Normalizer is as follows:

Figure 7. Standardization vs. Normalization (Figure by Author)

We will choose StandardScaler() for scaling our dataset and run the following code:

Scaling Train and Test Datasets

Building the ANN Model

After standardizing our data, we can start building our artificial neural network. We will create one Input Layer (Dense), one Output Layer (Dense), and one Hidden Layer (Dense). After each layer until the Output Layer, we will apply 0.2 Dropout for regularization to fight over-fitting. Finally, we will build the model with Keras Classifier to apply GridSearchCV on this neural network. As we have 14 explanatory variables, our input_dimension must be equal to 14. Since we will make binary classification, our final output layer must output a single value for Survived or Not-Survived classification. The other units in between are “try-and-see” values, and we selected 128 neurons.

Building an ANN with Keras Classifier

Grid Search Cross-Validation

After building the ANN, we will use scikit-learn GridSearchCV to find the best parameters and tune our ANN to get the best results. We will try different optimizers, epochs, and batch_sizes with the following code.

Grid Search with Keras Classifier

After running this code and printing out the best parameters, we get the following output:

Figure 8. Best Parameters and the Accuracy

Please note that we did not activate the Cross-Validation in the GridSearchCV. If you would like to add cross-validation functionality to GridSearchCV, select a cv value inside the GridSearch (e.g., cv=5).

Fitting the Model with Best Parameters

Now that we found the best parameters, we can re-create our classifier with the best parameter values and fit our training dataset with the following code:

Fitting with the Best Parameters

Since we obtain the prediction, we may conduct the final operations to make it ready for submission. One thing to note is that our ANN gives us the probabilities of survival, which is a continuous numerical variable. However, we need a binary categorical variable. Therefore, we are also making the necessary operation with the lambda function below to convert the continuous values to binary values (0 or 1) and writing the results to a CSV file.

Creating the Submission File


Figure 9. Deep Learning vs. Older Algorithms (Figure by Author)

You have created an artificial neural network to classify the Survivals of titanic passengers. Neural Networks are proved to outperform all the other machine learning algorithms as long as there is a large volume of data. Since our dataset only consists of 1309 lines, some machine learning algorithms such as Gradient Boosting Tree or Random Forest with good tuning may outperform neural networks. However, for datasets with large volumes, this will not be the case, as you may see on the chart below:

I would say that Titanic Dataset may be on the left side of the intersection of where older algorithms outperform deep learning algorithms. However, we will still achieve an accuracy rate higher than 80%, around the natural accuracy level.

3 Pre-Trained Model Series to Use for NLP with Transfer Learning

3 Pre-Trained Model Series to Use for NLP with Transfer Learning

Using State-of-the-Art Pre-trained Neural Network Models (OpenAI’s GPTs, BERTs, ELMos) to Tackle Natural Language Processing Problems with Transfer Learning


Figure 1. Photo by Safar Safarov on Unsplash

Before we start, if you are reading this article, I am sure that we share similar interests and are/will be in similar industries. So let’s connect via Linkedin! Please do not hesitate to send a contact request! Orhan G. Yalçın — Linkedin

If you have been trying to build machine learning models with high accuracy; but never tried Transfer Learning, this article will change your life. At least, it did mine!


Figure 2. A Depiction of Transfer Learning Logic (Figure by Author)

Note that this post is also a follow-up post of a post on Transfer Learning for Computer vision tasks. It has started to gain popularity, and now I wanted to share the NLP version of that with you. But, just in case, check it out:

Most of us have already tried several machine learning tutorials to grasp the basics of neural networks. These tutorials helped us understand the basics of artificial neural networks such as Recurrent Neural Networks, Convolutional Neural Networks, GANs, and Autoencoders. But, their main functionality was to prepare you for real-world implementations.

Now, if you are planning to build an AI system that utilizes deep learning, you have to either

  • have deep pockets for training and excellent AI researchers at your disposal*; or
  • benefit from transfer learning.
* According to BD Tech Talks, the training cost of OpenAI's GPT3 exceeded US $4.6 million dollars.

What is Transfer Learning?

Transfer learning is a subfield of machine learning and artificial intelligence, which aims to apply the knowledge gained from one task (source task) to a different but similar task (target task). In other words:

Transfer learning is the improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned.

For example, the knowledge gained while learning to classify Wikipedia texts can help tackle legal text classification problems. Another example would be using the knowledge gained while learning to classify cars to recognize the birds in the sky. As you can see, there is a relation between these examples. We are not using a text classification model on bird detection.

In summary, transfer learning saves us from reinventing the wheel, meaning we don’t waste time doing the things that have already been done by a major company. Thanks to transfer learning, we can build AI applications in a very short amount of time.

History of Transfer Learning

The history of Transfer Learning dates back to 1993. With her paper, Discriminability-Based Transfer between Neural Networks, Lorien Pratt opened the pandora’s box and introduced the world to the potential of transfer learning. In July 1997, the journal Machine Learning published a special issue for transfer learning papers. As the field advanced, adjacent topics such as multi-task learning were also included under the field of transfer learning. Learning to Learn is one of the pioneer books in this field. Today, transfer learning is a powerful source for tech entrepreneurs to build new AI solutions and researchers to push machine learning frontiers.

To show the power of transfer learning, we can quote from Andrew Ng:

Transfer learning will be the next driver of machine learning’s commercial success after supervised learning.


Figure 3. A Depiction of Commercial Potential of Learning Approaches (Figure by Author)

There are three requirements to achieve transfer learning:

  • Development of an Open Source Pre-trained Model by a Third Party
  • Repurposing the Model
  • Fine Tuning for the Problem

Development of an Open Source Pre-trained Model

A pre-trained model is a model created and trained by someone else to solve a similar problem. In practice, someone is almost always a tech giant or a group of star researchers. They usually choose a very large dataset as their base datasets, such as ImageNet or the Wikipedia Corpus. Then, they create a large neural network (e.g., VGG19 has 143,667,240 parameters) to solve a particular problem (e.g., this problem is image classification for VGG19). Of course, this pre-trained model must be made public so that we can take it and repurpose it.

Repurposing the Model

After getting our hands on these pre-trained models, we repurpose the learned knowledge, which includes the layers, features, weights, and biases. There are several ways to load a pre-trained model into our environment. In the end, it is just a file/folder which contains the relevant information. Deep learning libraries already host many of these pre-trained models, which makes them more accessible and convenient:

You can use one of the sources above to load a trained model. It will usually come with all the layers and weights, and you can edit the network as you wish. Additionally, some research labs maintain their own repos, as you will see for ELMo later in this post.

Fine-Tuning for the Problem

Well, while the current model may work for our problem. It is often better to fine-tune the pre-trained model for two reasons:

  • So that we can achieve even higher accuracy;
  • Our fine-tuned model can generate the output in the correct format.

Generally speaking, in a neural network, while the bottom and mid-level layers usually represent general features, the top layers represent the problem-specific features. Since our new problem is different than the original problem, we tend to drop the top layers. By adding layers specific to our problems, we can achieve higher accuracy.

After dropping the top layers, we need to place our own layers so that we can get the output we want. For example, a model trained with English Wikipedia such as BERT can be customized by adding additional layers and further trained with the IMDB Reviews dataset to predict movie reviews sentiments.

After adding our custom layers to the pre-trained model, we can configure it with special loss functions and optimizers and fine-tune it with extra training.

For a quick Transfer Learning tutorial, you may visit the post below:

3 Popular Pre-Trained Model Series for Natural Language Processing

Here are the three pre-trained network series you can use for natural language processing tasks ranging from text classification, sentiment analysis, text generation, word embedding, machine translation, and so on:


Figure 4. Overall Network Comparison for BERT, OpenAI GPT, ELMo (Figure from the BERT paper)

While BERT and OpenAI GPT are based on transformers network, ELMo takes advantage of bidirectional LSTM network.

Ok, let’s dive into them one-by-one.

Open AI GPT Series (GPT-1, GPT-2, and GPT-3)

There are three generations of GPT models created by OpenAI. GPT, which stands for Generative Pre-trained Transformers, is an autoregressive language model that uses deep learning to produce human-like text. Currently, the most advanced GPT available is GPT-3; and the most complex version of GPT-3 has over 175 billion parameters. Before the release of GPT-3 in May 2020, the most complex pre-trained NLP model was Microsoft’s Turing NLG.

GPT-3 can create very realistic text, which is sometimes difficult to distinguish from the human-generated text. That’s why the engineers warned of the GPT-3’s potential dangers and called for risk mitigation research. Here is a video about 14 cool apps built on GPT-3:

As opposed to most other pre-trained NLP models, OpenAI chose not to share the GPT-3’s source code. Instead, they allowed invitation-based API access, and you can apply for a license by visiting their website. Check it out:

On September 22, 2020, Microsoft announced it had licensed “exclusive” use of GPT-3. Therefore, while others have to rely on the API to receive output, Microsoft has control of the source code. Here is brief info about its size and performance:

  • Year Published: 2020 (GPT-3)
  • Size: Unknown
  • Q&A: F1-Scores of 81.5 in zero-shot, 84.0 in one-shot, 85.0 in few-shot learning
  • TriviaAQ: Accuracy of 64.3%
  • LAMBADA: Accuracy of 76.2%
  • Number of Parameters: 175,000,000,000

BERTs (BERT, RoBERTa (by Facebook), DistilBERT, and XLNet)

BERT stands for Bidirectional Encoder Representations from Transformers, and it is a state-of-the-art machine learning model used for NLP tasks. Jacob Devlin and his colleagues developed BERT at Google in 2018. Devlin and his colleagues trained the BERT on English Wikipedia (2.5B words) and BooksCorpus (0.8B words) and achieved the best accuracies for some of the NLP tasks in 2018. There are two pre-trained general BERT variations: The base model is a 12-layer, 768-hidden, 12-heads, 110M parameter neural network architecture, whereas the large model is a 24-layer, 1024-hidden, 16-heads, 340M parameter neural network architecture. Figure 2 shows the visualization of the BERT network created by Devlin et al.


Figure 5. Overall pre-training and fine-tuning procedures for BERT (Figure from the BERT paper)

Even though BERT seems more inferior to GPT-3, the availability of source code to the public makes the model much more popular among developers. You can easily load a BERT variation for your NLP task using the Hugging Face’s Transformers library. Besides, there are several BERT variations, such as original BERT, RoBERTa (by Facebook), DistilBERT, and XLNet. Here is a helpful TDS post on their comparison:

Here is brief info about BERT’s size and performance:

  • Year Published: 2018
  • Size: 440 MB (BERT Baseline)
  • GLUE Benchmark: Average accuracy of 82.1%
  • SQuAD v2.0: Accuracy of 86.3%
  • Number of Parameters: 110,000,000–340,000,000

ELMo Variations

ELMo, short for Embeddings from Language Models, is a word embedding system for representing words and phrases as vectors. ELMo models the syntax and semantic of words as well as their linguistic context, and it was developed by the Allen Institute for Brain Science. There several variations of ELMo, and the most complex ELMo model (ELMo 5.5B) was trained on a dataset of 5.5B tokens consisting of Wikipedia (1.9B) and all of the monolingual news crawl data from WMT 2008–2012 (3.6B). While both BERT and GPT models are based on transformation networks, ELMo models are based on bi-directional LSTM networks.

Here is brief info about ELMo’s size and performance:

  • Year Published: 2018
  • Size: 357 MB (ELMo 5.5B)
  • SQuAD: Accuracy of 85.8%
  • NER: Accuracy of 92.2%
  • Number of Parameters: 93,600,000

Just like BERT models, we also have access to ELMo source code. You can download the different variations of ELMos from Allen NLP’s Website:

Other Pre-Trained Models for Computer Vision Problems

Although there are several other pre-trained NLP models available in the market (e.g., GloVe), GPT, BERT, and ELMo are currently the best pre-trained models out there. Since this post aims to introduce these models, we will not have a code-along tutorial. But, I will share several tutorials where we exploit these very advanced pre-trained NLP models.


In a world where we have easy access to state-of-the-art neural network models, trying to build your own model with limited resources is like trying to reinvent the wheel. It is pointless.

Instead, try to work with these train models, add a couple of new layers on top considering your particular natural language processing task, and train. The results will be much more successful than a model you build from scratch.