Subscribe to Get All the Blog Posts and Colab Notebooks 

Sentiment Analysis in 10 Minutes with BERT and TensorFlow

Sentiment Analysis in 10 Minutes with BERT and TensorFlow

Learn the basics of the pre-trained NLP model, BERT, and build a sentiment classifier using the IMDB movie reviews dataset, TensorFlow, and Hugging Face transformers

I prepared this tutorial because it is somehow very difficult to find a blog post with actual working BERT code from the beginning till the end. They are always full of bugs. So, I have dug into several articles, put together their codes, edited them, and finally have a working BERT model. So, just by running the code in this tutorial, you can actually create a BERT model and fine-tune it for sentiment analysis.


Figure 1. Photo by Lukas on Unsplash

Natural language processing (NLP) is one of the most cumbersome areas of artificial intelligence when it comes to data preprocessing. Apart from the preprocessing and tokenizing text datasets, it takes a lot of time to train successful NLP models. But today is your lucky day! We will build a sentiment classifier with a pre-trained NLP model: BERT.

What is BERT?

BERT stands for Bidirectional Encoder Representations from Transformers and it is a state-of-the-art machine learning model used for NLP tasks. Jacob Devlin and his colleagues developed BERT at Google in 2018. Devlin and his colleagues trained the BERT on English Wikipedia (2,500M words) and BooksCorpus (800M words) and achieved the best accuracies for some of the NLP tasks in 2018. There are two pre-trained general BERT variations: The base model is a 12-layer, 768-hidden, 12-heads, 110M parameter neural network architecture, whereas the large model is a 24-layer, 1024-hidden, 16-heads, 340M parameter neural network architecture. Figure 2 shows the visualization of the BERT network created by Devlin et al.



Figure 2. Overall pre-training and fine-tuning procedures for BERT (Figure from the BERT paper)

So, I don’t want to dive deep into BERT since we need a whole different post for that. In fact, I already scheduled a post aimed at comparing rival pre-trained NLP models. But, you will have to wait for a bit.

Additionally, I believe I should mention that although Open AI’s GPT3 outperforms BERT, the limited access to GPT3 forces us to use BERT. But rest assured, BERT is also an excellent NLP model. Here is a basic visual network comparison among rival NLP models: BERT, GPT, and ELMo:


Figure 3. Differences in pre-training model architectures of BERT, GPT, and ELMo (Figure from the BERT paper)

Installing Hugging Face Transformers Library

One of the questions that I had the most difficulty resolving was to figure out where to find the BERT model that I can use with TensorFlow. Finally, I discovered Hugging Face’s Transformers library.

Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.

We can easily load a pre-trained BERT from the Transformers library. But, make sure you install it since it is not pre-installed in the Google Colab notebook.

Sentiment Analysis with BERT

Now that we covered the basics of BERT and Hugging Face, we can dive into our tutorial. We will do the following operations to train a sentiment analysis model:

  • Install Transformers library;
  • Load the BERT Classifier and Tokenizer alıng with Input modules;
  • Download the IMDB Reviews Data and create a processed dataset (this will take several operations;
  • Configure the Loaded BERT model and Train for Fine-tuning
  • Make Predictions with the Fine-tuned Model

Let’s get started!

Note that I strongly recommend you to use a Google Colab notebook. If you want to learn more about how you will create a Google Colab notebook, check out this article:

Installing Transformers

Installing the Transformers library is fairly easy. Just run the following pip line on a Google Colab cell:

After the installation is completed, we will load the pre-trained BERT Tokenizer and Sequence Classifier as well as InputExample and InputFeatures. Then, we will build our model with the Sequence Classifier and our tokenizer with BERT’s Tokenizer.

Let’s see the summary of our BERT model:

Here are the results. We have the main BERT model, a dropout layer to prevent overfitting, and finally a dense layer for classification task:



Figure 4. Summary of BERT Model for Sentiment Classification

Now that we have our model, let’s create our input sequences from the IMDB reviews dataset:

IMDB Dataset

IMDB Reviews Dataset is a large movie review dataset collected and prepared by Andrew L. Maas from the popular movie rating service, IMDB. The IMDB Reviews dataset is used for binary sentiment classification, whether a review is positive or negative. It contains 25,000 movie reviews for training and 25,000 for testing. All these 50,000 reviews are labeled data that may be used for supervised deep learning. Besides, there is an additional 50,000 unlabeled reviews that we will not use in this case study. In this case study, we will only use the training dataset.

Initial Imports

We will first have two imports: TensorFlow and Pandas.

Get the Data from the Stanford Repo

Then, we can download the dataset from Stanford’s relevant directory with tf.keras.utils.get_file function, as shown below:

Remove Unlabeled Reviews

To remove the unlabeled reviews, we need the following operations. The comments below explain each operation:

Train and Test Split

Now that we have our data cleaned and prepared, we can create text_dataset_from_directory with the following lines. I want to process the entire data in a single batch. That’s why I selected a very large batch size:

Convert to Pandas to View and Process

Now we have our basic train and test datasets, I want to prepare them for our BERT model. To make it more comprehensible, I will create a pandas dataframe from our TensorFlow dataset object. The following code converts our train Dataset object to train pandas dataframe:

Here is the first 5 row of our dataset:



Figure 5. First 5 Row of Our Dataset

I will do the same operations for the test dataset with the following lines:

Creating Input Sequences

We have two pandas Dataframe objects waiting for us to convert them into suitable objects for the BERT model. We will take advantage of the InputExample function that helps us to create sequences from our dataset. The InputExample function can be called as follows:

Now we will create two main functions:

1 — convert_data_to_examples: This will accept our train and test datasets and convert each row into an InputExample object.

2 — convert_examples_to_tf_dataset: This function will tokenize the InputExample objects, then create the required input format with the tokenized objects, finally, create an input dataset that we can feed to the model.

We can call the functions we created above with the following lines:

Our dataset containing processed input sequences are ready to be fed to the model.

Configuring the BERT model and Fine-tuning

We will use Adam as our optimizer, CategoricalCrossentropy as our loss function, and SparseCategoricalAccuracy as our accuracy metric. Fine-tuning the model for 2 epochs will give us around 95% accuracy, which is great.

Training the model might take a while, so ensure you enabled the GPU acceleration from the Notebook Settings. After our training is completed, we can move onto making sentiment predictions.

Making Predictions

I created a list of two reviews I created. The first one is a positive review, while the second one is clearly negative.

We need to tokenize our reviews with our pre-trained BERT tokenizer. We will then feed these tokenized sequences to our model and run a final softmax layer to get the predictions. We can then use the argmax function to determine whether our sentiment prediction for the review is positive or negative. Finally, we will print out the results with a simple for loop. The following lines do all of these said operations:


Figure 6. Our Dummy Reviews with Their Predictions

Also, with the code above, you can predict as many reviews as possible.


You have successfully built a transformers network with a pre-trained BERT model and achieved ~95% accuracy on the sentiment analysis of the IMDB reviews dataset! If you are curious about saving your model, I would like to direct you to the Keras Documentation. After all, to efficiently use an API, one must learn how to read and use the documentation.

Fast Neural Style Transfer in 5 Minutes with TensorFlow Hub & Magenta

Fast Neural Style Transfer in 5 Minutes with TensorFlow Hub & Magenta

Transferring van Gogh’s Unique Style to Photos with Magenta’s Arbitrary Image Stylization Network and Deep Learning

Before we start the tutorial: If you are reading this article, we probably share similar interests and are/will be in similar industries. So let’s connect via Linkedin! Please do not hesitate to send a contact request! Orhan G. Yalçın — Linkedin

 Figure 1. A Neural Style Transfer Example made with Arbitrary Image Stylization Network

I am sure you have come across to deep learning projects on transferring styles of famous painters to new photos. Well, I have been thinking about working on a similar project, but I realized that you can make neural style transfer within minutes, like the one in Figure 1. I will show you how in a second. But, let’s cover some basics first:

Neural Style Transfer (NST)

Neural style transfer is a method to blend two images and create a new image from a content image by copying the style of another image, called style image. This newly created image is often referred to as the stylized image.

History of NST

Image stylization is a two-decade-old problem in the field of non-photorealistic rendering. Non-photorealistic rendering is the opposite of photorealism, which is the study of reproducing an image as realistically as possible. The output of a neural style transfer model is an image that looks similar to the content image but in painting form in the style of the style image.

 Figure 2. Original Work of Leon Gatys on CV-Foundation

Neural style transfer (NST) was first published in the paper “A Neural Algorithm of Artistic Style” by Gatys et al., originally released in 2015. The novelty of the NST method was the use of deep learning to separate the representation of the content of an image from its style of depiction. To achieve this, Gatys et al. used VGG-19 architecture, which was pre-trained on the ImageNet dataset. Even though we can build a custom model following the same methodology, for this tutorial, we will benefit from the models provided in TensorFlow Hub.

Image Analogy

Before the introduction of NST, the most prominent solution to image stylization was the image analogy method. Image Analogy is a method of creating a non-photorealistic rendering filter automatically from training data. In this process, the transformation between photos (A) and non-photorealistic copies (A’) are learned. After this learning process, the model can produce a non-photorealistic copy (B’) from another photo (B). However, NST methods usually outperform image analogy due to the difficulty of finding training data for the image analogy models. Therefore, we can talk about the superiority of NST over image analogy in real-world applications, and that’s why we will focus on the application of an NST model.

 Figure 3. Photo by Jonathan Cosens on Unsplash

Is it Art?

Well, once we build the model, you will see that creating non-photorealistic images with Neural Style Transfer is a very easy task. You can create a lot of samples by blending beautiful photos with the paintings of talented artists. There has been a discussion about whether these outputs are regarded as art because of the little work the creator needs to add to the end product. Feel free to build the model, generate your samples, and share your thoughts in the comments section.

Now that you know the basics of Neural Style Transfer, we can move on to TensorFlow Hub, the repository that we use for our NST work.

TensorFlow Hub

TensorFlow Hub is a collection of trained machine learning models that you can use with ease. TensorFlow’s official description for the Hub is as follows:

TensorFlow Hub is a repository of trained machine learning models ready for fine-tuning and deployable anywhere. Reuse trained models like BERT and Faster R-CNN with just a few lines of code.

Apart from pre-trained models such as BERT or Faster R-CNN, there are a good amount of pre-trained models. The one we will use is Magenta’s Arbitrary Image Stylization network. Let’s take a look at what Magenta is.

Magenta and Arbitrary Image Stylization

What is Magenta?

 Figure 4. Magenta Logo on Magenta

Magenta is an open-source research project, backed by Google, which aims to provide machine learning solutions to musicians and artists. Magenta has support in both Python and Javascript. Using Magenta, you can create songs, paintings, sounds, and more. For this tutorial, we will use a network trained and maintained by the Magenta team for Arbitrary Image Stylization.

Arbitrary Image Stylization

After observing that the original work for NST proposes a slow optimization for style transfer, the Magenta team developed a fast artistic style transfer method, which can work in real-time. Even though the customizability of the model is limited, it is satisfactory enough to perform a non-photorealistic rendering work with NST. Arbitrary Image Stylization under TensorFlow Hub is a module that can perform fast artistic style transfer that may work on arbitrary painting styles.

By now, you already know what Neural Style Transfer is. You also know that we will benefit from the Arbitrary Image Stylization module developed by the Magenta team, which is maintained in TensorFlow Hub.

Now it is time to code!

Get the Image Paths

 Figure 5. Photo by Paul Hanaoka on Unsplash

We will start by selecting two image files. I will directly load these image files from URLs. You are free to choose any photo you want. Just change the filename and URL in the code below. The content image I selected for this tutorial is the photo of a cat staring at the camera, as you can see in Figure 5.

 Figure 6. Bedroom in Arles by Vincent van Gogh

I would like to transfer the style of van Gogh. So, I chose one of his famous paintings: Bedroom in Arles, which he painted in 1889 while staying in Arles, Bouches-du-Rhône, France. Again, you are free to choose any painting of any artist you want. You can even use your own drawings.

The below code sets the path to get the image files shown in Figure 5 and Figure 6.


Custom Function for Image Scaling

  One thing I noticed that, even though we are very limited with model customization, by rescaling the images, we can change the style transferred to the photo. In fact, I found out that the smaller the images, the better the model transfers the style. Just play with the max_dim parameter if you would like to experiment. Just note that a larger max_dim means, it will take slightly longer to generate the stylized image.  


We will call the img_scaler function below, inside the load_img function.

Custom Function for Preprocessing the Image

Now that we set our image paths to load and img_scaler function to scale the loaded image, we can actually load our image files with the custom function below.

Every line in the Gist below is explained with comments. Please read carefully.


    Now our custom image loading function, load_img, is also created. All we have to do is to call it.  

Load the Content and Style Images

  For content image and style image, we need to call the load_img function once and the result will be a 4-dimensional Tensor, which is what will be required by our model below. The below lines is for this operation.  


Now that we successfully loaded our images, we can plot them with matplotlib, as shown below:


and here is the output:


Figure 7. Content Image on the Left (Photo by Paul Hanaoka on Unsplash) | Style Image on the Right (Bedroom in Arles by Vincent van Gogh)

You are not gonna believe this, but the difficult part is over. Now we can create our network and pass these image Tensors as arguments for NST operation.

Load the Arbitrary Image Stylization Network

We need to import the tensorflow_hub library so that we can use the modules containing the pre-trained models. After importing tensorflow_hub, we can use the load function to load the Arbitrary Image Stylization module as shown below. Finally, as shown in the documentation, we can pass the content and style images as arguments in tf.constant object format. The module returns our stylized image in an array format.

All we have to do is to use this array and plot it with matplotlib. The below lines create a plot free from all the axis and large enough for you to review the image.

… And here is our stylized image:


Figure 8. Paul Hanaoka’s Photo after Neural Style Transfer

Figure 9 summarizes what we have done in this tutorial:

  Figure 9. A Neural Style Transfer Example made with Arbitrary Image Stylization Network


As you can see, with a minimal amount of code (we did not even train a model), we did a pretty good Neural Style Transfer on a random image we took from Unsplash using a painting from Vincent van Gogh. Try different photos and paintings to discover the capabilities of the Arbitrary Image Stylization network. Also, play around with max_dim size, you will see that the style transfer changes to a great extent.

3 Pre-Trained Model Series to Use for NLP with Transfer Learning

3 Pre-Trained Model Series to Use for NLP with Transfer Learning

Using State-of-the-Art Pre-trained Neural Network Models (OpenAI’s GPTs, BERTs, ELMos) to Tackle Natural Language Processing Problems with Transfer Learning


Figure 1. Photo by Safar Safarov on Unsplash

Before we start, if you are reading this article, I am sure that we share similar interests and are/will be in similar industries. So let’s connect via Linkedin! Please do not hesitate to send a contact request! Orhan G. Yalçın — Linkedin

If you have been trying to build machine learning models with high accuracy; but never tried Transfer Learning, this article will change your life. At least, it did mine!


Figure 2. A Depiction of Transfer Learning Logic (Figure by Author)

Note that this post is also a follow-up post of a post on Transfer Learning for Computer vision tasks. It has started to gain popularity, and now I wanted to share the NLP version of that with you. But, just in case, check it out:

Most of us have already tried several machine learning tutorials to grasp the basics of neural networks. These tutorials helped us understand the basics of artificial neural networks such as Recurrent Neural Networks, Convolutional Neural Networks, GANs, and Autoencoders. But, their main functionality was to prepare you for real-world implementations.

Now, if you are planning to build an AI system that utilizes deep learning, you have to either

  • have deep pockets for training and excellent AI researchers at your disposal*; or
  • benefit from transfer learning.
* According to BD Tech Talks, the training cost of OpenAI's GPT3 exceeded US $4.6 million dollars.

What is Transfer Learning?

Transfer learning is a subfield of machine learning and artificial intelligence, which aims to apply the knowledge gained from one task (source task) to a different but similar task (target task). In other words:

Transfer learning is the improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned.

For example, the knowledge gained while learning to classify Wikipedia texts can help tackle legal text classification problems. Another example would be using the knowledge gained while learning to classify cars to recognize the birds in the sky. As you can see, there is a relation between these examples. We are not using a text classification model on bird detection.

In summary, transfer learning saves us from reinventing the wheel, meaning we don’t waste time doing the things that have already been done by a major company. Thanks to transfer learning, we can build AI applications in a very short amount of time.

History of Transfer Learning

The history of Transfer Learning dates back to 1993. With her paper, Discriminability-Based Transfer between Neural Networks, Lorien Pratt opened the pandora’s box and introduced the world to the potential of transfer learning. In July 1997, the journal Machine Learning published a special issue for transfer learning papers. As the field advanced, adjacent topics such as multi-task learning were also included under the field of transfer learning. Learning to Learn is one of the pioneer books in this field. Today, transfer learning is a powerful source for tech entrepreneurs to build new AI solutions and researchers to push machine learning frontiers.

To show the power of transfer learning, we can quote from Andrew Ng:

Transfer learning will be the next driver of machine learning’s commercial success after supervised learning.


Figure 3. A Depiction of Commercial Potential of Learning Approaches (Figure by Author)

There are three requirements to achieve transfer learning:

  • Development of an Open Source Pre-trained Model by a Third Party
  • Repurposing the Model
  • Fine Tuning for the Problem

Development of an Open Source Pre-trained Model

A pre-trained model is a model created and trained by someone else to solve a similar problem. In practice, someone is almost always a tech giant or a group of star researchers. They usually choose a very large dataset as their base datasets, such as ImageNet or the Wikipedia Corpus. Then, they create a large neural network (e.g., VGG19 has 143,667,240 parameters) to solve a particular problem (e.g., this problem is image classification for VGG19). Of course, this pre-trained model must be made public so that we can take it and repurpose it.

Repurposing the Model

After getting our hands on these pre-trained models, we repurpose the learned knowledge, which includes the layers, features, weights, and biases. There are several ways to load a pre-trained model into our environment. In the end, it is just a file/folder which contains the relevant information. Deep learning libraries already host many of these pre-trained models, which makes them more accessible and convenient:

You can use one of the sources above to load a trained model. It will usually come with all the layers and weights, and you can edit the network as you wish. Additionally, some research labs maintain their own repos, as you will see for ELMo later in this post.

Fine-Tuning for the Problem

Well, while the current model may work for our problem. It is often better to fine-tune the pre-trained model for two reasons:

  • So that we can achieve even higher accuracy;
  • Our fine-tuned model can generate the output in the correct format.

Generally speaking, in a neural network, while the bottom and mid-level layers usually represent general features, the top layers represent the problem-specific features. Since our new problem is different than the original problem, we tend to drop the top layers. By adding layers specific to our problems, we can achieve higher accuracy.

After dropping the top layers, we need to place our own layers so that we can get the output we want. For example, a model trained with English Wikipedia such as BERT can be customized by adding additional layers and further trained with the IMDB Reviews dataset to predict movie reviews sentiments.

After adding our custom layers to the pre-trained model, we can configure it with special loss functions and optimizers and fine-tune it with extra training.

For a quick Transfer Learning tutorial, you may visit the post below:

3 Popular Pre-Trained Model Series for Natural Language Processing

Here are the three pre-trained network series you can use for natural language processing tasks ranging from text classification, sentiment analysis, text generation, word embedding, machine translation, and so on:


Figure 4. Overall Network Comparison for BERT, OpenAI GPT, ELMo (Figure from the BERT paper)

While BERT and OpenAI GPT are based on transformers network, ELMo takes advantage of bidirectional LSTM network.

Ok, let’s dive into them one-by-one.

Open AI GPT Series (GPT-1, GPT-2, and GPT-3)

There are three generations of GPT models created by OpenAI. GPT, which stands for Generative Pre-trained Transformers, is an autoregressive language model that uses deep learning to produce human-like text. Currently, the most advanced GPT available is GPT-3; and the most complex version of GPT-3 has over 175 billion parameters. Before the release of GPT-3 in May 2020, the most complex pre-trained NLP model was Microsoft’s Turing NLG.

GPT-3 can create very realistic text, which is sometimes difficult to distinguish from the human-generated text. That’s why the engineers warned of the GPT-3’s potential dangers and called for risk mitigation research. Here is a video about 14 cool apps built on GPT-3:

As opposed to most other pre-trained NLP models, OpenAI chose not to share the GPT-3’s source code. Instead, they allowed invitation-based API access, and you can apply for a license by visiting their website. Check it out:

On September 22, 2020, Microsoft announced it had licensed “exclusive” use of GPT-3. Therefore, while others have to rely on the API to receive output, Microsoft has control of the source code. Here is brief info about its size and performance:

  • Year Published: 2020 (GPT-3)
  • Size: Unknown
  • Q&A: F1-Scores of 81.5 in zero-shot, 84.0 in one-shot, 85.0 in few-shot learning
  • TriviaAQ: Accuracy of 64.3%
  • LAMBADA: Accuracy of 76.2%
  • Number of Parameters: 175,000,000,000

BERTs (BERT, RoBERTa (by Facebook), DistilBERT, and XLNet)

BERT stands for Bidirectional Encoder Representations from Transformers, and it is a state-of-the-art machine learning model used for NLP tasks. Jacob Devlin and his colleagues developed BERT at Google in 2018. Devlin and his colleagues trained the BERT on English Wikipedia (2.5B words) and BooksCorpus (0.8B words) and achieved the best accuracies for some of the NLP tasks in 2018. There are two pre-trained general BERT variations: The base model is a 12-layer, 768-hidden, 12-heads, 110M parameter neural network architecture, whereas the large model is a 24-layer, 1024-hidden, 16-heads, 340M parameter neural network architecture. Figure 2 shows the visualization of the BERT network created by Devlin et al.


Figure 5. Overall pre-training and fine-tuning procedures for BERT (Figure from the BERT paper)

Even though BERT seems more inferior to GPT-3, the availability of source code to the public makes the model much more popular among developers. You can easily load a BERT variation for your NLP task using the Hugging Face’s Transformers library. Besides, there are several BERT variations, such as original BERT, RoBERTa (by Facebook), DistilBERT, and XLNet. Here is a helpful TDS post on their comparison:

Here is brief info about BERT’s size and performance:

  • Year Published: 2018
  • Size: 440 MB (BERT Baseline)
  • GLUE Benchmark: Average accuracy of 82.1%
  • SQuAD v2.0: Accuracy of 86.3%
  • Number of Parameters: 110,000,000–340,000,000

ELMo Variations

ELMo, short for Embeddings from Language Models, is a word embedding system for representing words and phrases as vectors. ELMo models the syntax and semantic of words as well as their linguistic context, and it was developed by the Allen Institute for Brain Science. There several variations of ELMo, and the most complex ELMo model (ELMo 5.5B) was trained on a dataset of 5.5B tokens consisting of Wikipedia (1.9B) and all of the monolingual news crawl data from WMT 2008–2012 (3.6B). While both BERT and GPT models are based on transformation networks, ELMo models are based on bi-directional LSTM networks.

Here is brief info about ELMo’s size and performance:

  • Year Published: 2018
  • Size: 357 MB (ELMo 5.5B)
  • SQuAD: Accuracy of 85.8%
  • NER: Accuracy of 92.2%
  • Number of Parameters: 93,600,000

Just like BERT models, we also have access to ELMo source code. You can download the different variations of ELMos from Allen NLP’s Website:

Other Pre-Trained Models for Computer Vision Problems

Although there are several other pre-trained NLP models available in the market (e.g., GloVe), GPT, BERT, and ELMo are currently the best pre-trained models out there. Since this post aims to introduce these models, we will not have a code-along tutorial. But, I will share several tutorials where we exploit these very advanced pre-trained NLP models.


In a world where we have easy access to state-of-the-art neural network models, trying to build your own model with limited resources is like trying to reinvent the wheel. It is pointless.

Instead, try to work with these train models, add a couple of new layers on top considering your particular natural language processing task, and train. The results will be much more successful than a model you build from scratch.