Subscribe to Get All the Blog Posts and Colab Notebooks 

Sentiment Analysis in 10 Minutes with Rule-Based VADER and NLTK

Sentiment Analysis in 10 Minutes with Rule-Based VADER and NLTK

Figure 1. Photo by Marian Kroell on Unsplash

Using the “Valence Aware Dictionary and sEntiment Reasoner” on the IMDB Reviews Dataset for Rule-based Sentiment Analysis

For a long time, I have been writing on statistical NLP topics and sharing tutorials. The sub-field of statistical NLP is responsible for several impressive advancements in the field of natural language processing, and it has the highest potential among competing approaches. However, in some cases, the contribution of rule-based classical natural language processing might be sought.

When to Use Rule-Based Approach Instead of Statistical NLP

In cases where researchers have deep pockets with lots of talented researchers and dealing with general problems, statistical NLP is usually the preferred way to tackle the NLP problem. But, in the following cases, the rule-based approach might be fruitful:

1 — Domain-Specific Problem:

We have great pre-trained models such as GPT-3, BERT, ELMo, which do wonders on generic language problems. However, when we try to use them in domain-specific problems such as financial news sentiment analysis or legal text classification, the specificity required for such tasks may not be satisfied by these state-of-the-art models. Therefore, we either have to fine-tune these models with additional labeled data or rely on rule-based models.

2 — Lack of Labeled Data:

Even though we might want to fine-tune a model, it may not always be possible. Especially, if you are with a small team or don’t have the funds to hire people via freelancing platforms such as Amazon Mechanical Turk, you cannot generate labeled data to fine-tune a pre-trained model, not to mention build your own deep learning model. Lastly, it may not be possible to collect a meaningful amount of data to train a deep learning model. In the end, statistical NLP models are very data-hungry.

3 — Limited Available Funding for Training:

Even though you have some available labeled specific data, training a dedicated model has its own cost. Not only that, you would need a group of star data scientists, but you also need distributed-servers to train your model, and your pockets may not be that deep.

If you have one of these issues, your best bet might be rule-based NLP, and the accuracy levels of rule-based NLPs are not as bad as you might think. In this post, we will build a simple Lexicon-based Sentiment Classifier without much tuning, and we will achieve an acceptable accuracy performance, which may be increased even further.

Before starting, though, let’s cover some basics:

What is a Lexicon?

Lexicon sounds like a fancy technical term, but it means a dictionary, usually in a particular domain. In other words:

A lexicon is the vocabulary of a person, language, or branch of knowledge.

In a rule-based NLP study for sentiment analysis, we need a lexicon that serves as a reference manual to measure the sentiment of a chunk of text (e.g., word, phrase, sentence, paragraph, full text). Lexicon-based sentiment analysis can be as simple as positive-labeled words minus negative-labeled words to see if a text has a positive sentiment. It can also be very complex with negation rules, distance calculations, added-variance, and several additional rules. One of the main differences between rule-based NLP and statistical NLP is that in rule-based NLP, the researcher is completely free to add any rule they deem useful. Therefore, in rule-based NLP, what we usually see is that highly trained experts develop theory-based rules in a particular domain and apply them to a particular problem in this particular domain.

What is VADER?

One of the most popular rule-based sentiment analysis models is VADER. VADER, or Valence Aware Dictionary and sEntiment Reasoner, is a lexicon and rule-based sentiment analysis tool specifically attuned to sentiments expressed in social media.

VADER is like the GPT-3 of Rule-Based NLP Models.

Since it is tuned for social media content, it performs best on the content you can find on social media. However, it still offers acceptable F1 Scores on other test sets, and provides a comparable performance compared to complex statistical models such as Support Vector Machines, as you can see below:

Figure 2. Three-class Accuracy (F1 scores) for Each Machine Trained Model (Figure from the paper)

Note that there are several alternative lexicons that you can use for your project, such as Harvard’s General Inquirer, Loughran McDonald, Hu & Liu. In this tutorial, we will adopt the VADER’s lexicon along with its methodology.

Now that you have a basic understanding of rule-based NLP models, we can proceed with our tutorial. This tutorial will approach a classic sentiment analysis problem from a rule-based NLP perspective: A Lexicon-based sentiment analysis on the IMDB Reviews Dataset.

Let’s start:

Sentiment Analysis on IMDB Dataset

What is IMDB Reviews Dataset?

IMDB Reviews Dataset is a large movie review dataset collected and prepared by Andrew L. Maas from the popular movie rating service, IMDB. The IMDB Reviews dataset is used for binary sentiment classification, whether a review is positive or negative. It contains 25,000 movie reviews for training and 25,000 for testing. All these 50,000 reviews are labeled data that may be used for supervised deep learning. Besides, there is an additional 50,000 unlabeled reviews that we will not use in this case study. In this case study, we will only use the training dataset.

Loading and Processing the Dataset

We will start by loading the IMDB dataset by using Keras’s Data API. However, Keras provides the dataset in the encoded version. Luckily we can also load the index dictionary to decode it to original reviews. The following lines will load the encoded reviews along with the index. We will also create the reverse index for decoding:

Before decoding the entire dataset, let’s see the operation with an example:

Output: this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came...

As you can see, we can decode our encoded reviews using the reversed index. If we can decode one review, for all the reviews, all we need is a for loop. With the code below, we will create a nested list in which we place the sentiment label and the decoded review text. We also need to do error handling due to a typo in the dataset (apparently Keras team encoded one of the words wrong :/). But, the code below handles this error as well:

Finally, we will create a pandas DataFrame from the nested list we created above:


Figure 3. Summary Info on Our IMDB Reviews Dataset | Figure 4. The first 10 Rows of our IMDB Reviews Dataset (Figures by Author) (Note that we skipped the single review with the incorrect encoding)

Now that our data is ready, we can load VADER.

Loading VADER Sentiment Intensity Analyzer

Python’s most advanced NLP tool, NLTK, provides a module for VADER, and we can easily import a SentimentIntensityAnalzer with the following code:

To test how our model works, let’s feed a simple sentence with neutral and negative words:

We use the polarity_score calculator of our SentimentIntensityAnalyzer model. This model gives us four scores: (i) Negativity, (ii) Positivity, (iii) Neutrality score of the sentence, and finally, (iv) Compound sentiment score of the sentence. The compound score is basically an aggregated version of the first three scores, and we will be using this score to measure the sentiment of our reviews. Here is the output of our dummy sentence: “Hello, world. I am terrible”.


Figure 5. VADER Polarity Scores of Our Dummy Sentence (Figure by Author)

Calculating Polarity Scores and Predicting:

Since we successfully calculated the sentiment score of a single sentence, all we have to do is run a loop for the entire dataset. Before calculating the score, although not necessary, I will shuffle the dataset first:

Instead of for loop, we will use a more efficient alternative: apply a lambda function on our dataset column Text and create a new column to save the results with the name Prediction. The following single line does these:

Editing Labels and Creating Accuracy Column:

The code above will simply convert negative compound scores to -1 and positive compound scores to 1. Since our IMDB Reviews dataset has 1 for positive sentiments and 0 for negative sentiments, we will change 0s with -1s so that we can calculate the accuracy of our predictions, which will be in a new column, called Accuracy:

Create a Column for Confusion Matrix

Finally, I want to create a confusion matrix to properly measure our rule-based NLP sentiment classifier’s success. A confusion matrix shows True Positives, True Negatives, False Positives, and False Negatives, which we can use to calculate Accuracy, Recall, Precision, and F1 Scores. With the lines below, I will create a custom function to generate confusion matrix tags and apply them as a lambda function, as I did above:

The code above will create a column, Conf_Matrix, and tag each prediction with abbreviations of True Positives, True Negatives, False Positives, and False Negatives.

Let’s see how the tail of our final DataFrame looks:



Figure 6. The Last 10 Rows of Our Final DataFrame (Figures by Author)

Calculating Accuracy, Recall, Precision, and F1 Score:

To see how we did with our VADER model, I will use several custom formulas to calculate Accuracy, Recall, Precision, and F1 Score. Although there are several API solutions for confusion matrix calculations, I decided to use custom calculations:

Thanks to the Conf_Matrix column, these calculations are straightforward to handle, and here are the results:


Figure 7. Performance Indicators of Lexicon-Based VADER Sentiment Classifier (Figures by Author)

As you can see, without a single second of training or customization, we achieved a 70% accuracy on the movie reviews dataset. Don’t forget that VADER is a social-media lexicon. Therefore, using a movie-reviews-based Lexicon would give us even higher performances.


You have successfully built a sentiment classifier that is based on rule-based NLP. One of the biggest advantages of rule-based NLP methods is that they are fully explainable as opposed to glamorous transformer-based NLP models such as BERT and GPT-3. Therefore, apart from its budget-friendly nature, the model’s explainability is another great reason to rely on rule-based NLP models, especially in sensitive areas.

Sentiment Analysis in 10 Minutes with BERT and TensorFlow

Sentiment Analysis in 10 Minutes with BERT and TensorFlow

Learn the basics of the pre-trained NLP model, BERT, and build a sentiment classifier using the IMDB movie reviews dataset, TensorFlow, and Hugging Face transformers

I prepared this tutorial because it is somehow very difficult to find a blog post with actual working BERT code from the beginning till the end. They are always full of bugs. So, I have dug into several articles, put together their codes, edited them, and finally have a working BERT model. So, just by running the code in this tutorial, you can actually create a BERT model and fine-tune it for sentiment analysis.


Figure 1. Photo by Lukas on Unsplash

Natural language processing (NLP) is one of the most cumbersome areas of artificial intelligence when it comes to data preprocessing. Apart from the preprocessing and tokenizing text datasets, it takes a lot of time to train successful NLP models. But today is your lucky day! We will build a sentiment classifier with a pre-trained NLP model: BERT.

What is BERT?

BERT stands for Bidirectional Encoder Representations from Transformers and it is a state-of-the-art machine learning model used for NLP tasks. Jacob Devlin and his colleagues developed BERT at Google in 2018. Devlin and his colleagues trained the BERT on English Wikipedia (2,500M words) and BooksCorpus (800M words) and achieved the best accuracies for some of the NLP tasks in 2018. There are two pre-trained general BERT variations: The base model is a 12-layer, 768-hidden, 12-heads, 110M parameter neural network architecture, whereas the large model is a 24-layer, 1024-hidden, 16-heads, 340M parameter neural network architecture. Figure 2 shows the visualization of the BERT network created by Devlin et al.



Figure 2. Overall pre-training and fine-tuning procedures for BERT (Figure from the BERT paper)

So, I don’t want to dive deep into BERT since we need a whole different post for that. In fact, I already scheduled a post aimed at comparing rival pre-trained NLP models. But, you will have to wait for a bit.

Additionally, I believe I should mention that although Open AI’s GPT3 outperforms BERT, the limited access to GPT3 forces us to use BERT. But rest assured, BERT is also an excellent NLP model. Here is a basic visual network comparison among rival NLP models: BERT, GPT, and ELMo:


Figure 3. Differences in pre-training model architectures of BERT, GPT, and ELMo (Figure from the BERT paper)

Installing Hugging Face Transformers Library

One of the questions that I had the most difficulty resolving was to figure out where to find the BERT model that I can use with TensorFlow. Finally, I discovered Hugging Face’s Transformers library.

Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.

We can easily load a pre-trained BERT from the Transformers library. But, make sure you install it since it is not pre-installed in the Google Colab notebook.

Sentiment Analysis with BERT

Now that we covered the basics of BERT and Hugging Face, we can dive into our tutorial. We will do the following operations to train a sentiment analysis model:

  • Install Transformers library;
  • Load the BERT Classifier and Tokenizer alıng with Input modules;
  • Download the IMDB Reviews Data and create a processed dataset (this will take several operations;
  • Configure the Loaded BERT model and Train for Fine-tuning
  • Make Predictions with the Fine-tuned Model

Let’s get started!

Note that I strongly recommend you to use a Google Colab notebook. If you want to learn more about how you will create a Google Colab notebook, check out this article:

Installing Transformers

Installing the Transformers library is fairly easy. Just run the following pip line on a Google Colab cell:

After the installation is completed, we will load the pre-trained BERT Tokenizer and Sequence Classifier as well as InputExample and InputFeatures. Then, we will build our model with the Sequence Classifier and our tokenizer with BERT’s Tokenizer.

Let’s see the summary of our BERT model:

Here are the results. We have the main BERT model, a dropout layer to prevent overfitting, and finally a dense layer for classification task:



Figure 4. Summary of BERT Model for Sentiment Classification

Now that we have our model, let’s create our input sequences from the IMDB reviews dataset:

IMDB Dataset

IMDB Reviews Dataset is a large movie review dataset collected and prepared by Andrew L. Maas from the popular movie rating service, IMDB. The IMDB Reviews dataset is used for binary sentiment classification, whether a review is positive or negative. It contains 25,000 movie reviews for training and 25,000 for testing. All these 50,000 reviews are labeled data that may be used for supervised deep learning. Besides, there is an additional 50,000 unlabeled reviews that we will not use in this case study. In this case study, we will only use the training dataset.

Initial Imports

We will first have two imports: TensorFlow and Pandas.

Get the Data from the Stanford Repo

Then, we can download the dataset from Stanford’s relevant directory with tf.keras.utils.get_file function, as shown below:

Remove Unlabeled Reviews

To remove the unlabeled reviews, we need the following operations. The comments below explain each operation:

Train and Test Split

Now that we have our data cleaned and prepared, we can create text_dataset_from_directory with the following lines. I want to process the entire data in a single batch. That’s why I selected a very large batch size:

Convert to Pandas to View and Process

Now we have our basic train and test datasets, I want to prepare them for our BERT model. To make it more comprehensible, I will create a pandas dataframe from our TensorFlow dataset object. The following code converts our train Dataset object to train pandas dataframe:

Here is the first 5 row of our dataset:



Figure 5. First 5 Row of Our Dataset

I will do the same operations for the test dataset with the following lines:

Creating Input Sequences

We have two pandas Dataframe objects waiting for us to convert them into suitable objects for the BERT model. We will take advantage of the InputExample function that helps us to create sequences from our dataset. The InputExample function can be called as follows:

Now we will create two main functions:

1 — convert_data_to_examples: This will accept our train and test datasets and convert each row into an InputExample object.

2 — convert_examples_to_tf_dataset: This function will tokenize the InputExample objects, then create the required input format with the tokenized objects, finally, create an input dataset that we can feed to the model.

We can call the functions we created above with the following lines:

Our dataset containing processed input sequences are ready to be fed to the model.

Configuring the BERT model and Fine-tuning

We will use Adam as our optimizer, CategoricalCrossentropy as our loss function, and SparseCategoricalAccuracy as our accuracy metric. Fine-tuning the model for 2 epochs will give us around 95% accuracy, which is great.

Training the model might take a while, so ensure you enabled the GPU acceleration from the Notebook Settings. After our training is completed, we can move onto making sentiment predictions.

Making Predictions

I created a list of two reviews I created. The first one is a positive review, while the second one is clearly negative.

We need to tokenize our reviews with our pre-trained BERT tokenizer. We will then feed these tokenized sequences to our model and run a final softmax layer to get the predictions. We can then use the argmax function to determine whether our sentiment prediction for the review is positive or negative. Finally, we will print out the results with a simple for loop. The following lines do all of these said operations:


Figure 6. Our Dummy Reviews with Their Predictions

Also, with the code above, you can predict as many reviews as possible.


You have successfully built a transformers network with a pre-trained BERT model and achieved ~95% accuracy on the sentiment analysis of the IMDB reviews dataset! If you are curious about saving your model, I would like to direct you to the Keras Documentation. After all, to efficiently use an API, one must learn how to read and use the documentation.

Predict Tomorrow’s Bitcoin (BTC) Price with Recurrent Neural Networks

Predict Tomorrow’s Bitcoin (BTC) Price with Recurrent Neural Networks

Using Recurrent Neural Networks to Predict Next Days Cryptocurrency Prices with TensorFlow and Keras | Supervised Deep Learning

If you are reading this article, I am sure that we share similar interests and are/will be in similar industries. So let’s connect via Linkedin! Please do not hesitate to send a contact request! Orhan G. Yalçın — Linkedin


Photo by Andre Francois on Unsplash

Wouldn’t it be awesome if you were, somehow, able to predict tomorrow’s Bitcoin (BTC) price? As you all know, the cryptocurrency market has experienced tremendous volatility over the last year. The value of Bitcoin reached its peak on December 16, 2017, by climbing to nearly $20,000, and then it has seen a steep decline at the beginning of 2018. Not long ago, though, a year ago, to be precise, its value was almost half of what it is today. Therefore, if we look at the yearly BTC price chart, we may easily see that the price is still high. The fact that only two years ago, BTC’s value was only the one-tenth of its current value is even more shocking. You may personally explore the historical BTC prices using this plot below:

Historical Bitcoin (BTC) Prices by CoinDesk

There are several conspiracies regarding the precise reasons behind this volatility, and these theories are also used to support the prediction reasoning of crypto prices, particularly of BTC. These subjective arguments are valuable to predict the future of cryptocurrencies. On the other hand, our methodology evaluates historical data to predict the cryptocurrency prices from an algorithmic trading perspective. We plan to use numerical historical data to train a recurrent neural network (RNN) to predict BTC prices.

Obtaining the Historical Bitcoin Prices

There are quite a few resources we may use to obtain historical Bitcoin price data. While some of these resources allow the users to download CSV files manually, others provide an API that one can hook up to his code. Since when we train a model using time series data, we would like it to make up-to-date predictions, I prefer to use an API to obtain the latest figures whenever we run our program. After a quick search, I have decided to use’s API, which provides up-to-date coin prices that we can use in any platform.

Recurrent Neural Networks

Since we are using a time series dataset, it is not viable to use a feedforward neural network as tomorrow’s BTC price is most correlated with today’s, not a month ago’s.

A recurrent neural network (RNN) is a class of artificial neural network where connections between nodes form a directed graph along a sequence. — Wikipedia

An RNN shows temporal dynamic behavior for a time sequence, and it can use its internal state to process sequences. In practice, this can be achieved with LSTMs and GRUs layers.

Here you can see the difference between a regular feedforward-only neural network and a recurrent neural network (RNN):

 RNN vs. Regular Nets by Niklas Donges on TowardsDataScience

Our Roadmap

To be able to create a program that trains on the historical BTC prices and predict tomorrow’s BTC price, we need to complete several tasks as follows:

1 — Obtaining, Cleaning, and Normalizing the Historical BTC Prices

2 — Building an RNN with LSTM

3 — Training the RNN and Saving The Trained Model

4 — Predicting Tomorrow’s BTC Price and “Deserialize” It

BONUS: Deserializing the X_Test Predictions and Creating a Chart

Obtaining, Cleaning, and Normalizing the Historical BTC Prices

Obtaining the BTC Data

As I mentioned above, we will use’s API for the BTC dataset and convert it into a Pandas dataframe with the following code:

Obtaining the BTC Prices with CoinRanking API

This function is adjusted for 5-years BTC/USD prices by default. However, you may always change these values by passing in different parameter values.

Cleaning the Data with Custom Functions

After obtaining the data and converting it to a pandas dataframe, we may define custom functions to clean our data, normalize it for a neural network as it is a must for accurate results, and apply a custom train-test split. We created a custom train-test split function (not the scikit-learn’s) because we need to keep the time-series in order for training our RNN properly. We may achieve this with the following code, and you may find further function explanations in the code snippet below:

Defining custom functions for matrix creation, normalizing, and train-test split

After defining these functions, we may call them with the following code:

Calling the defined functions for data cleaning, preparation, and splitting

Building an RNN with LSTM

After preparing our data, it is time for building the model that we will later train by using the cleaned&normalized data. We will start by importing our Keras components and setting some parameters with the following code:

Setting the RNN Parameters in Advance

Then, we will create our Sequential model with two LSTM and two Dense layers with the following code:

Creating a Sequential Model and Filling it with LSTM and Dense Layers

Training the RNN and Saving The Trained Model

Now it is time to train our model with the cleaned data. You can also measure the time spent during the training. Follow these codes:

Training the RNN Model using the Prepared Data

Don’t forget to save it:

Saving the Trained Model

I am keen to save the model and load it later because it is quite satisfying to know that you can actually save a trained model and re-load it to use it next time. This is basically the first step for web or mobile integrated machine learning applications.

Predicting Tomorrow’s BTC Price and “Deserialize” It

After we train the model, we need to obtain the current data for predictions, and since we normalize our data, predictions will also be normalized. Therefore, we need to de-normalize back to their original values. Firstly, we will obtain the data in a similar, partially different, manner with the following code:

Loading the last 30 days’ BTC Prices

We will only have the normalized data for prediction: No train-test split. We will also reshape the data manually to be able to use it in our saved model.

After cleaning and preparing our data, we will load the trained RNN model for prediction and predict tomorrow’s price.

Loading the Trained Model and Making the Prediction

However, our results will vary between -1 and 1, which will not make a lot of sense. Therefore, we need to de-normalize them back to their original values. We can achieve this with a custom function:

We need a deserializer for the Original BTC Prediction Value in USD

After defining the custom function, we will call these function and extract tomorrow’s BTC prices with the following code:

Calling the deserializer and extracting the Price in USD

With the code above, you can actually get the model’s prediction for tomorrow’s BTC prices.

Deserializing the X_Test Predictions and Creating a Chart

You may also be interested in the overall result of the RNN model and prefer to see it as a chart. We can also achieve these by using our X_test data from the training part of the tutorial.

We will start by loading our model (consider this as an alternative to the single prediction case) and making the prediction on X_test data so that we can make predictions for a proper number of days for plotting with the following code:

Loading the Trained Model and Making Prediction Using the X_test Values

Next, we will import Plotly and set the properties for a good plotting experience. We will achieve this with the following code:

Importing Plotly and Setting the Parameters

After setting all the properties, we can finally plot our predictions and observation values with the following code:

Creating a Dataframe and Using it in Plotly’s iPlot

When you run this code, you will come up with the up-to-date version of the following plot: Chart for BTC Price Predictions

How Reliable Are These Results?

As you can see, it does not look bad at all. However, you need to know that even though the patterns match pretty closely, the results are still dangerously apart from each other if you inspect the results on a day-to-day basis. Therefore, the code must be further developed to get better results.


You have successfully created and trained an RNN model that can predict BTC prices, and you even saved the trained model for later use. You may use this trained model on a web or mobile application by switching to Object-Oriented Programming. Pat yourself on the back for successfully developing a model relevant to artificial intelligence, blockchain, and finance. I think it sounds pretty cool to touch these areas all at once with this simple project.

Mastering Word Embeddings in 10 Minutes with TensorFlow

Mastering Word Embeddings in 10 Minutes with TensorFlow

Covering the Basics of Word Embedding, One Hot Encoding, Text Vectorization, Embedding Layers, and an Example Neural Network Architecture for NLP

Figure 1. Photo by Nick Hillier on Unsplash

Listen to the Audio Version

Word embedding is one of the most important concepts in Natural Language Processing (NLP). It is an NLP technique where words or phrases (i.e., strings) from a vocabulary are mapped to vectors of real numbers. The need to map strings into vectors of real numbers originated from computers’ inability to do operations with strings.

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

There are several NLP techniques to convert strings into representative numbers, such as:

  • One Hot Encoding
  • Encoding with a Unique Number
  • Word Embedding

Before diving into word embedding, let’s compare these three options to see why Word embedding is the best.

One Hot Encoding

One-hot encoding can be achieved by creating a vector of zeros with the length of the entire vocabulary. Then, we only place “one” in the index where the word is. For each word, we create the same vector. Below you can see an example of one hot encoding where we encode the sentence, “His dog is two years old”.

 Figure 2. One Hot Encoding for “His dog is two years old” (Figure by Author)

As the vocabulary size increases, so does the size of the vector. One hot encoding is usually regarded as an inefficient method to vectorize strings. As you can see in the example above, most of the values are zero. This approach is considered computationally complex in an unnecessary way.

Encoding with a Unique ID Number

Instead of creating a vector of zeros for each word in a sentence, you may choose to assign each word a unique ID number. Therefore, instead of using strings, you may assign 1 to “dog”, 2 to “his”, 3 to “is”, 4 to “old”, 5 to “two”, 6 to “years”. This way, we can represent our string, “His dog is two years old”, as [2, 1, 3, 5, 6, 4].

 Figure 3. Encoding with a Unique ID Number for “His dog is two years old” (Figure by Author)

As you can see, it is much easier to create. However, it comes with a covenant: These numbers do not have any relational representation since the encoding is arbitrary. There is nothing that stops us from changing the encoding differently. Therefore, these values do not capture a relationship between words. The lack of relational representation makes it difficult for models to interpret these values since the value of the feature (the number assigned) does not mean anything. Therefore, encoding with a unique ID number is also not a very good idea to capture patterns.

Word Embedding

Since one-hot encoding is inefficient and encoding with unique IDs does not offer relational representation, we have to rely on another method. This method is word embedding. As we already covered above, word embedding is the task of converting words into vectors.

 Figure 4. A Basic 3-Dimensional Word Embedding for “His dog is” (Figure by Author)

As you can see above, you can actually see a relational representation of the words. We didn’t have this feature in encoding with a unique number. Furthermore, we can also represent each word with a less complex vector compared to one-hot encoding. The size of our final array is much smaller than a one-hot encoded vocabulary.

Deep Learning for the Vectorization

So, in an ideal world, our goal is to find the perfect vector weights for each word so that their relations can be properly represented. But how are we going to know whether a word is closely related to another? Of course, we have an intuition since we have been learning languages all of our lives. But, our model must receive the training it requires to analyze these words. Therefore, we need a machine learning problem. We can adopt supervised, unsupervised, or semi-supervised learning approaches for word embedding. In this post, we will tackle a supervised learning problem to vectorize our strings.

A Neural Network Architecture for Word Embedding

We need to create a neural network to find ideal vector values for each word. At this stage, what we need is the following:

1 — A Text Vectorization layer for converting Text to 1D Tensor of Integers

2 — An Embedding layer to convert 1D Tensors of Integers into dense vectors of fixed size.

3 — A fully connected neural network for backpropagation and cost function and other deep learning tasks

Text Vectorization Layer

Text vectorization is an experimental layer that offers a lot of value for the text preprocessing automation. The layer does the following procedures:

  • Standardize each sentence with lowercasing and punctuation stripping
  • Split each sentence into words
  • Recombine words into tokens
  • Index these tokens
  • Transform each sentence using the index into a 1D tensor of integer token indices or float values.

Here is a straightforward example of how the TextVectorization layer works.

We first create a vocabulary in line with our examples above.

 Figure 5. The Vocabulary of TextVectorization Layer Example

Then, we create a TextVectorization layer. Then, we create a dummy model with Keras Sequential API. Then, we feed several sentences to create a list of 1D tensors of integers:

 Figure 6. The I/O of the Operations with TextVectorization Layer

The length of the 1D output tensors is three because we set it to three. Feel free to try other values.

Embedding Layer

The embedding layer has a simple capability:

It turns positive integers (indexes) into dense vectors of fixed size.

Let’s see it with a basic example:

I passed the output from the TextVectorization example as input and set the output dimension to two. Therefore, each of our input integers is now represented with a 2-dims vector. While our input shape is (3, 3), our output shape is (3, 3, 2).

 Figure 7. The I/O of the Operations with Embedding Layer

By changing the output dim, you can easily generate a more complex vector space, which would increase the computational complexity, but can potentially capture more pattern.

A Set of Fully Connected Layers

 Figure 8. A Basic Example of A Set of Fully Connected Layers (Figure by Author)

After TextVectorization and Embedding layers, we end up with the desired vector space. But, these values are still very random. Therefore, we need an optimizer that calculates the cost function and adjusts the values with backpropagation. This optimizer should be run on top of a set of fully connected layers. In some situations, you can also add other types of layers such as LSTM, GRU, or Convolution layers. To keep it simple, I will refer to them as a set of fully connected layers.

Note: If your embedding output has -by any chance- variable length, make sure to add a Global Pooling Layer before the set of fully connected layers.

Using a set of fully connected layers, we can feed our strings against labels to adjust the vector values. For example, assume that we have a dataset of sentences with their sentiment labels (e.g., positive or negative). We can then vectorize and embed these sentences using a vocabulary we created from our unique words. We can then create a set of fully connected layers to predict whether their sentiment is positive or negative. We can use the labels to backpropagate our model to optimize these vectors. Our result would look something similar to this:

 Figure 9. Word Embedding created using Word2Vec (Figure by Author on Embedding Projector)

Final Notes

Roughly speaking, a neural network architecture for Word embedding would look like this:

 Figure 10. A Basic Artificial Neural Network Architecture for Word embedding (Figure by Author)

As I mentioned above, a complex word embedding model would require other layers such as Global Pooling, LSTM, GRU, Convolution, and others. However, the structure remains the same.

Now that you have an idea about Word embedding, in Part II of this series, we can actually create our own word embedding using the IMDB Reviews dataset and visualize it as in Figure 9. Check out Part 2:

Mastering Word Embeddings in 10 Minutes with IMDB Reviews

Mastering Word Embeddings in 10 Minutes with IMDB Reviews

Learn the Basics of Text Vectorization, Create a Word Embedding Model trained with a Neural Network on IMDB Reviews Dataset, and Visualize it with TensorBoard Embedding Projector

Figure 1. Photo by Raphael Schaller on Unsplash

This is a follow-up tutorial prepared after Part I of the tutorial, Mastering Word Embeddings in 10 Minutes with TensorFlow, where we introduce several word vectorization concepts such as One Hot Encoding and Encoding with a Unique ID Value. I would highly recommend you to check this tutorial if you are new to natural language processing.

In Part II of the tutorial, we will vectorize our words and trained their values using the IMDB Reviews dataset. This tutorial is our own take on TensorFlow’s tutorial on word embedding. We will train a word embedding using a simple Keras model and the IMDB Reviews dataset. Then, we will visualize them using Embedding Projector.

Let’s start:

Create a New Google Colab Notebook

First of all, you need the environment to start coding. For the sake of simplicity, I recommend you work with Google Colab. It comes with all the libraries pre-installed, and you won’t have to worry about them. All you need is a Google account, and I am sure you have one. So, create a new Colab notebook (see Figure 2) and start coding.

  Figure 2: Create a New Google Colab Notebook

Initial Imports

We will start by importing TensorFlow and os libraries. We will use the os library for some directory level operations we will do below and the TensorFlow library for dataset loading, deep learning models, and text preprocessing.

Download the IMDB Reviews Dataset

IMDB Reviews Dataset is a large movie review dataset collected and prepared by Andrew L. Maas from the popular movie rating service, IMDB. The IMDB Reviews dataset is used for binary sentiment classification, whether a review is positive or negative. It contains 25,000 movie reviews for training and 25,000 for testing. All these 50,000 reviews are labeled data that may be used for supervised deep learning. Besides, there is an additional 50,000 unlabeled reviews that we will not use in this case study. In this case study, we will only use the training dataset.

We can download the dataset from Stanford’s relevant directory with tf.keras.utils.get_file function, as shown below:

Dataset Creation

We need a little bit of housekeeping to create a proper dataset. Let’s start with viewing our main directory with the

As you can see below, we have our train and test folders. For this study, we will only use the /train folder

Figure 3. The Content of Main Directory, “aclImdb”

With the following lines, let’s view what’s under the /train subdirectory:

and here it is:

Figure 4.a. The Content of Sub-Directory, “aclImdb/train”

We have reviews with negative sentiments and positive sentiments. Next step, we will remove theunsup folder, which contains unlabeled reviews. Since we are working on a supervised learning problem in this tutorial, we do not need it.

As you can see in Figure X, we removed the unsup folder thanks to theshutil library:

 Figure 4. b. The Content of Sub-Directory, “aclImdb/train” after we remove the “unsup” folder

Create the Dataset

Now that we cleaned our directory, we can create our Dataset object. For this, we can use thetf.keras.preprocessing.text_dataset_from_directory function. As the name suggests, the text_dataset_from_directory function allows us to create text datasets directly from a directory. We selected an 80/20 train and validation split, but feel free to play around by adjusting the validation_split argument.

As you can see in Figure 5, we have 20,000 reviews for training and 5,000 for validation.

 Figure 5. The volume of Our Train and Validation Dataset

Let’s check how our dataset looks by using the .take() function and run a for-loop. Note that our dataset is a TensorFlow Dataset object. It requires a little more effort to print out its elements. The following line does that:

And here is the results in Figure 6:

 Figure 6. The First Five Reviews from the Training Dataset with Their Sentiment Info in the Beginning

Configure the Dataset

Now, since we are in the realms of deep learning, optimization is essential for a bearable training experience. TensorFlow has an experimental tool that we can use to optimize the workload and shorten the time needed for preprocessing, training, and other parallel operations. We can optimize our pipeline with the following lines:

Text Preprocessing

Now that we created our dataset, it is time to process its elements so that our model can understand them.

Custom Standardization

We will create a custom string standardization function to make the best of standardization. Standardization can be described as a set of preprocessing operations for NLP studies, including lowercasing, tag removal, and punctuation stripping. In the below code, we are achieving exactly these:

Now our data will be more standardized with our custom function.


Since we created our custom standardization function, we can pass it in the TextVectorization layer we import from TensorFlow. TextVectorization is a layer that we use to map our strings to integers. We will pass in our custom standardization function, we will use up to 10,000 unique words (vocabulary), and we will keep a maximum of 100 words for each review. Check the below lines:

We will remove the labels from the train dataset and call the .adapt() function to build the vocabulary to use later on. Note that we haven’t vectorized our dataset yet. Just created the vocabulary with the lines below:

Model Building and Training

We already processed our reviews, and it is time to create our model.

Model Creation

We will make the inial imports, which include Sequential API for model building and Embedding, GlobalAveragePooling, and Dense layers we will use in the model.

We set the embedding dimension to 16, so each word will have 16 representative values. We limit the vocabulary size to 10,000 in parallel with the code above.

We add the following layers to our Keras model:

1 — A TextVectorization layer for converting strings to integers;

2 — AEmbedding layer to convert integer values with 16-dimensional vectors;

3 — A Global Average Pooling 1D layer to resolve the issue of having reviews with different lengths;

4 — A Dense layer with 16 neurons with a relu activation layer

5 — A final Dense layer with 1 neuron to classify if the review has a positive or negative sentiment.

The following lines do all these:

Set Up Callbacks for TensorBoard

Since we want to see how our model evolves and performs over time, we will configure our callback settings with the following lines:

We will use these callbacks to visualize our model performance at each epoch using TensorBoard

Configure the Model

Then, we will configure our model with Adam as optimizer and Binary Crossentropy as loss function because it is a binary classification task and select accuracy as our performance metric.

Start the Training

Now that our model is configured, we can use .fit() function to start the training. We will run for 15 epochs and record the callbacks for TensorBoard.

Here is the screenshot of the training process, as shown in Figure :

Figure 7. Model Performance at Each Epoch

Visualize the Results

Now that we concluded our model training let’s do some visualization to understand better what we built and trained.

The Summary of the Model

We can easily see the summary of our model with the .summary() function, as shown below:

Figure 8 shows how our model looks and lists the number of parameters and output shape for each layer:

 Figure 8. Our Model Summary

Training Performance on TensorBoard

Let’s see how our mode evolved as it trained on the IMDB reviews dataset. We can use TensorFlow’s visualization kit, TensorBoard. TensorBoard can be used for several machine learning visualization tasks such as:

  • Tracking and visualization loss and accuracy measures
  • Visualizing the model graph
  • Viewing the evolution of weights, biases, and other tensor values
  • Displaying images, text, and audio data
  • and more…
 Figure 9. A Screenshot of Our Tensorboard Instance

In this tutorial, we will use %load_ext to load TensorBoard and view the logs. The lines above will run a small server within our cell to visualize our metric values over time.

As you can see on the left, our accuracy increases over time while our loss values decrease. Figure 9 shows that our model does what it is supposed to do because decreasing loss value means that our model is doing something to lower its mistakes: learning.

Visualization with Embedding Projector

Our model looks nice and it learned a lot in just 15 epochs. But, the main goal of this tutorial to create a word embedding. We will not predict review sentiments in this tutorial. Instead, we will visualize our word embedding cloud using Embedding Projector.

Embedding Projector is a tool built on top of TensorBoard. It is a useful tool to analyze data and visualize the position of embedding values relative to one another. Using Embedding Projector, we can graphically represent high dimensional embedding by simplifying them using algorithms like PCA.

Get the Vector Values and Vocabulary Data

We will start by getting our 16-dimensional embedding values for each word. Also, we will get a list of all these words we embedded. We can achieve these two tasks with the following code:

Let’s see how our word and its vector values look with a random example. We selected the word with index no. 500 and visualize it with the following code:

The vector values and the corresponding word for index no. 500 is shown in Figure 10:

 Figure 10. A Random Example of Word-Vector Pair

Feel free to change the index value to view other words with their vector values.

Save the Data to New Files

Now we have the entire list of words (vocabulary) with their corresponding 16-dimensional vector values. We will save word names to the metadata.tsv file and vector values to the vectors.tsv file. The following lines create new files, write our data to these new files, save the data, close the files, and download them to your local machine:

Load to Embedding Projector

Now we visit the Embedding Projector website:

Then, we click the “Load” button on the left to load our vectors.tsv and metadata.tsv files. Then, we can click anywhere outside of the popup window.

and, voilà!

 Figure 11. Our Word Embedding Trained on IMDB Reviews Dataset

Note that Embedding Projectors runs a PCA algorithm to reduce the 16-dimensional vector space into 3-dimensional since this is the only way to visualize it.


You have successfully built a neural network to train a word embedding model, and it takes a lot of effort to achieve this. Pat yourself on the back and keep improving yourself in the field of natural language processing, as there are many unsolved problems.

Kaggle’s Titanic Competition in 10 Minutes | Part-III

Kaggle’s Titanic Competition in 10 Minutes | Part-III

Using Natural Language Processing (NLP), Deep Learning, and GridSearchCV in Kaggle’s Titanic Competition | Machine Learning Tutorials

Figure 1. Titanic Under Construction on Unsplash

If you follow my tutorial series on Kaggle’s Titanic Competition (Part-I and Part-II) or have already participated in the Competition, you are familiar with the whole story. If you are not familiar with it, since this is a follow-up tutorial, I strongly recommend you to check out the Competition Page or Part-I and Part-II of this tutorial series. In Part-III (Final) of the series, (i) we will use natural language processing (NLP) techniques to obtain the titles of the passengers, (ii) create an Artificial Neural Network (ANN or RegularNet) to train the model, and (iii) use Grid Search Cross-Validation to tune the ANN so that we get the best results.

Let’s start!


Part-I of the Tutorial

Throughout this tutorial series, we try to keep things simple and develop the story slowly and clearly. In Part-I of the tutorial, we learned to write a python program with less than 20 lines to enter the Kaggle’s Competition. Things were kept as simple as possible. We cleaned the non-numerical parts, took care of the null values, trained our model using the train.csv file, predicted the passenger’s survival in the test.csv file, and saved it as a CSV file for submission.

Part-II of the Tutorial

Since we did not explore the dataset properly in Part-I, we focus on data exploration in Part-II using Matplotlib and Seaborn. We impute the null values instead of dropping them by using aggregated functions, better cleaned the data, and finally generated the dummy variables from the categorical variables. Then, we use a RandomForestClassifier model instead of LogisticRegression, which also improves precision. We achieve an approximately 20% increase in precision compared to the model in Part-I.

Part-III of the Tutorial

Figure 2. A Diagram of an Artificial Neural Network with one Hidden Layer (Figure by Author)

We will now use the Name column to derive the passengers’ titles, which played a significant role in their survival chances. We will also create an Artificial Neural Network (ANN or RegularNets) with Keras to obtain better results. Finally, to tune the ANN model, we will use GridSearchCV to detect the best parameters. Finally, we will generate a new CSV file for submission.

Preparing the Dataset

Like what we have done in Part-I and Part-II, I will start cleaning data and imputing the null values. This time, we will adopt a different approach and combine the two datasets for cleaning and imputing. We already covered why we impute the null values the way we do in Part-II; therefore, we will give you the code straight away. If you feel that some operations do not make sense, you may refer to Part-II or comment below. However, since we saw in Part-II that people younger than 18 had a greater chance of survival, we should add a new feature to measure this effect.

Data Cleaning and Null Value Imputation

Deriving Passenger Titles with NLP

We will drop the unnecessary columns and generate the dummy variables from the categorical variables above. But first, we need to extract the titles from the ‘Name’ column. To understand what we are doing, we will start by running the following code to get the first 10 rows Name column values.

Name Column Values of the First 10 Rows

And here is what we get:

Figure 3. Name Column Values of the First 10 Row (Figure by Author)

The structure of the name column value is as follows:


Therefore, we need to split these String based on the dot and comma and extract the title. We can accomplish this with the following code:

Splitting the Name Values to Extract Titles

Once we run this code, we will have a Title column with titles in it. To be able to see what kind of titles do we have, we will run this:

Grouping Titles and Get the Counts

Figure 4. Unique Title Counts (Figure by Author)

It seems that we have four major groups: ‘Mr’, ‘Mrs’, ‘Miss’, ‘Master’, and others. However, before grouping all the other titles as Others, we need to take care of the French titles. We need to convert them to their corresponding English titles with the following code:

French to English Title Converter

Now, we only have officers and royal titles. It makes sense to combine them as Others. We can achieve this with the following code:

Combining All the Non-Major Titles as Others (Contains Officer and Royal Titles)

Figure 5. Final Unique Title Counts (Figure by Author)

Final Touch on Data Preparation

Now that our Titles are more manageable, we can create dummies and drop the unnecessary columns with the following code:

Final Touch on Data Preparation

Creating an Artificial Neural Network for Training

Figure 6. A Diagram of an Artificial Neural Network with Two Hidden Layers (Figure by Author)

Standardizing Our Data with Standard Scaler

To get a good result, we must scale our data by using Scikit Learn’s Standard Scaler. Standard Scaler standardizes features by removing the mean and scaling to unit variance (i.e., standardization), which is different than MinMaxScaler. The mathematical difference between Standardization and Normalizer is as follows:

Figure 7. Standardization vs. Normalization (Figure by Author)

We will choose StandardScaler() for scaling our dataset and run the following code:

Scaling Train and Test Datasets

Building the ANN Model

After standardizing our data, we can start building our artificial neural network. We will create one Input Layer (Dense), one Output Layer (Dense), and one Hidden Layer (Dense). After each layer until the Output Layer, we will apply 0.2 Dropout for regularization to fight over-fitting. Finally, we will build the model with Keras Classifier to apply GridSearchCV on this neural network. As we have 14 explanatory variables, our input_dimension must be equal to 14. Since we will make binary classification, our final output layer must output a single value for Survived or Not-Survived classification. The other units in between are “try-and-see” values, and we selected 128 neurons.

Building an ANN with Keras Classifier

Grid Search Cross-Validation

After building the ANN, we will use scikit-learn GridSearchCV to find the best parameters and tune our ANN to get the best results. We will try different optimizers, epochs, and batch_sizes with the following code.

Grid Search with Keras Classifier

After running this code and printing out the best parameters, we get the following output:

Figure 8. Best Parameters and the Accuracy

Please note that we did not activate the Cross-Validation in the GridSearchCV. If you would like to add cross-validation functionality to GridSearchCV, select a cv value inside the GridSearch (e.g., cv=5).

Fitting the Model with Best Parameters

Now that we found the best parameters, we can re-create our classifier with the best parameter values and fit our training dataset with the following code:

Fitting with the Best Parameters

Since we obtain the prediction, we may conduct the final operations to make it ready for submission. One thing to note is that our ANN gives us the probabilities of survival, which is a continuous numerical variable. However, we need a binary categorical variable. Therefore, we are also making the necessary operation with the lambda function below to convert the continuous values to binary values (0 or 1) and writing the results to a CSV file.

Creating the Submission File


Figure 9. Deep Learning vs. Older Algorithms (Figure by Author)

You have created an artificial neural network to classify the Survivals of titanic passengers. Neural Networks are proved to outperform all the other machine learning algorithms as long as there is a large volume of data. Since our dataset only consists of 1309 lines, some machine learning algorithms such as Gradient Boosting Tree or Random Forest with good tuning may outperform neural networks. However, for datasets with large volumes, this will not be the case, as you may see on the chart below:

I would say that Titanic Dataset may be on the left side of the intersection of where older algorithms outperform deep learning algorithms. However, we will still achieve an accuracy rate higher than 80%, around the natural accuracy level.