Subscribe to Get All the Blog Posts and Colab Notebooks 

Using State-of-the-Art Pre-trained Neural Network Models (OpenAI’s GPTs, BERTs, ELMos) to Tackle Natural Language Processing Problems with Transfer Learning


Figure 1. Photo by Safar Safarov on Unsplash

Before we start, if you are reading this article, I am sure that we share similar interests and are/will be in similar industries. So let’s connect via Linkedin! Please do not hesitate to send a contact request! Orhan G. Yalçın — Linkedin

If you have been trying to build machine learning models with high accuracy; but never tried Transfer Learning, this article will change your life. At least, it did mine!


Figure 2. A Depiction of Transfer Learning Logic (Figure by Author)

Note that this post is also a follow-up post of a post on Transfer Learning for Computer vision tasks. It has started to gain popularity, and now I wanted to share the NLP version of that with you. But, just in case, check it out:

Most of us have already tried several machine learning tutorials to grasp the basics of neural networks. These tutorials helped us understand the basics of artificial neural networks such as Recurrent Neural Networks, Convolutional Neural Networks, GANs, and Autoencoders. But, their main functionality was to prepare you for real-world implementations.

Now, if you are planning to build an AI system that utilizes deep learning, you have to either

  • have deep pockets for training and excellent AI researchers at your disposal*; or
  • benefit from transfer learning.
* According to BD Tech Talks, the training cost of OpenAI's GPT3 exceeded US $4.6 million dollars.

What is Transfer Learning?

Transfer learning is a subfield of machine learning and artificial intelligence, which aims to apply the knowledge gained from one task (source task) to a different but similar task (target task). In other words:

Transfer learning is the improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned.

For example, the knowledge gained while learning to classify Wikipedia texts can help tackle legal text classification problems. Another example would be using the knowledge gained while learning to classify cars to recognize the birds in the sky. As you can see, there is a relation between these examples. We are not using a text classification model on bird detection.

In summary, transfer learning saves us from reinventing the wheel, meaning we don’t waste time doing the things that have already been done by a major company. Thanks to transfer learning, we can build AI applications in a very short amount of time.

History of Transfer Learning

The history of Transfer Learning dates back to 1993. With her paper, Discriminability-Based Transfer between Neural Networks, Lorien Pratt opened the pandora’s box and introduced the world to the potential of transfer learning. In July 1997, the journal Machine Learning published a special issue for transfer learning papers. As the field advanced, adjacent topics such as multi-task learning were also included under the field of transfer learning. Learning to Learn is one of the pioneer books in this field. Today, transfer learning is a powerful source for tech entrepreneurs to build new AI solutions and researchers to push machine learning frontiers.

To show the power of transfer learning, we can quote from Andrew Ng:

Transfer learning will be the next driver of machine learning’s commercial success after supervised learning.


Figure 3. A Depiction of Commercial Potential of Learning Approaches (Figure by Author)

There are three requirements to achieve transfer learning:

  • Development of an Open Source Pre-trained Model by a Third Party
  • Repurposing the Model
  • Fine Tuning for the Problem

Development of an Open Source Pre-trained Model

A pre-trained model is a model created and trained by someone else to solve a similar problem. In practice, someone is almost always a tech giant or a group of star researchers. They usually choose a very large dataset as their base datasets, such as ImageNet or the Wikipedia Corpus. Then, they create a large neural network (e.g., VGG19 has 143,667,240 parameters) to solve a particular problem (e.g., this problem is image classification for VGG19). Of course, this pre-trained model must be made public so that we can take it and repurpose it.

Repurposing the Model

After getting our hands on these pre-trained models, we repurpose the learned knowledge, which includes the layers, features, weights, and biases. There are several ways to load a pre-trained model into our environment. In the end, it is just a file/folder which contains the relevant information. Deep learning libraries already host many of these pre-trained models, which makes them more accessible and convenient:

You can use one of the sources above to load a trained model. It will usually come with all the layers and weights, and you can edit the network as you wish. Additionally, some research labs maintain their own repos, as you will see for ELMo later in this post.

Fine-Tuning for the Problem

Well, while the current model may work for our problem. It is often better to fine-tune the pre-trained model for two reasons:

  • So that we can achieve even higher accuracy;
  • Our fine-tuned model can generate the output in the correct format.

Generally speaking, in a neural network, while the bottom and mid-level layers usually represent general features, the top layers represent the problem-specific features. Since our new problem is different than the original problem, we tend to drop the top layers. By adding layers specific to our problems, we can achieve higher accuracy.

After dropping the top layers, we need to place our own layers so that we can get the output we want. For example, a model trained with English Wikipedia such as BERT can be customized by adding additional layers and further trained with the IMDB Reviews dataset to predict movie reviews sentiments.

After adding our custom layers to the pre-trained model, we can configure it with special loss functions and optimizers and fine-tune it with extra training.

For a quick Transfer Learning tutorial, you may visit the post below:

3 Popular Pre-Trained Model Series for Natural Language Processing

Here are the three pre-trained network series you can use for natural language processing tasks ranging from text classification, sentiment analysis, text generation, word embedding, machine translation, and so on:


Figure 4. Overall Network Comparison for BERT, OpenAI GPT, ELMo (Figure from the BERT paper)

While BERT and OpenAI GPT are based on transformers network, ELMo takes advantage of bidirectional LSTM network.

Ok, let’s dive into them one-by-one.

Open AI GPT Series (GPT-1, GPT-2, and GPT-3)

There are three generations of GPT models created by OpenAI. GPT, which stands for Generative Pre-trained Transformers, is an autoregressive language model that uses deep learning to produce human-like text. Currently, the most advanced GPT available is GPT-3; and the most complex version of GPT-3 has over 175 billion parameters. Before the release of GPT-3 in May 2020, the most complex pre-trained NLP model was Microsoft’s Turing NLG.

GPT-3 can create very realistic text, which is sometimes difficult to distinguish from the human-generated text. That’s why the engineers warned of the GPT-3’s potential dangers and called for risk mitigation research. Here is a video about 14 cool apps built on GPT-3:

As opposed to most other pre-trained NLP models, OpenAI chose not to share the GPT-3’s source code. Instead, they allowed invitation-based API access, and you can apply for a license by visiting their website. Check it out:

On September 22, 2020, Microsoft announced it had licensed “exclusive” use of GPT-3. Therefore, while others have to rely on the API to receive output, Microsoft has control of the source code. Here is brief info about its size and performance:

  • Year Published: 2020 (GPT-3)
  • Size: Unknown
  • Q&A: F1-Scores of 81.5 in zero-shot, 84.0 in one-shot, 85.0 in few-shot learning
  • TriviaAQ: Accuracy of 64.3%
  • LAMBADA: Accuracy of 76.2%
  • Number of Parameters: 175,000,000,000

BERTs (BERT, RoBERTa (by Facebook), DistilBERT, and XLNet)

BERT stands for Bidirectional Encoder Representations from Transformers, and it is a state-of-the-art machine learning model used for NLP tasks. Jacob Devlin and his colleagues developed BERT at Google in 2018. Devlin and his colleagues trained the BERT on English Wikipedia (2.5B words) and BooksCorpus (0.8B words) and achieved the best accuracies for some of the NLP tasks in 2018. There are two pre-trained general BERT variations: The base model is a 12-layer, 768-hidden, 12-heads, 110M parameter neural network architecture, whereas the large model is a 24-layer, 1024-hidden, 16-heads, 340M parameter neural network architecture. Figure 2 shows the visualization of the BERT network created by Devlin et al.


Figure 5. Overall pre-training and fine-tuning procedures for BERT (Figure from the BERT paper)

Even though BERT seems more inferior to GPT-3, the availability of source code to the public makes the model much more popular among developers. You can easily load a BERT variation for your NLP task using the Hugging Face’s Transformers library. Besides, there are several BERT variations, such as original BERT, RoBERTa (by Facebook), DistilBERT, and XLNet. Here is a helpful TDS post on their comparison:

Here is brief info about BERT’s size and performance:

  • Year Published: 2018
  • Size: 440 MB (BERT Baseline)
  • GLUE Benchmark: Average accuracy of 82.1%
  • SQuAD v2.0: Accuracy of 86.3%
  • Number of Parameters: 110,000,000–340,000,000

ELMo Variations

ELMo, short for Embeddings from Language Models, is a word embedding system for representing words and phrases as vectors. ELMo models the syntax and semantic of words as well as their linguistic context, and it was developed by the Allen Institute for Brain Science. There several variations of ELMo, and the most complex ELMo model (ELMo 5.5B) was trained on a dataset of 5.5B tokens consisting of Wikipedia (1.9B) and all of the monolingual news crawl data from WMT 2008–2012 (3.6B). While both BERT and GPT models are based on transformation networks, ELMo models are based on bi-directional LSTM networks.

Here is brief info about ELMo’s size and performance:

  • Year Published: 2018
  • Size: 357 MB (ELMo 5.5B)
  • SQuAD: Accuracy of 85.8%
  • NER: Accuracy of 92.2%
  • Number of Parameters: 93,600,000

Just like BERT models, we also have access to ELMo source code. You can download the different variations of ELMos from Allen NLP’s Website:

Other Pre-Trained Models for Computer Vision Problems

Although there are several other pre-trained NLP models available in the market (e.g., GloVe), GPT, BERT, and ELMo are currently the best pre-trained models out there. Since this post aims to introduce these models, we will not have a code-along tutorial. But, I will share several tutorials where we exploit these very advanced pre-trained NLP models.


In a world where we have easy access to state-of-the-art neural network models, trying to build your own model with limited resources is like trying to reinvent the wheel. It is pointless.

Instead, try to work with these train models, add a couple of new layers on top considering your particular natural language processing task, and train. The results will be much more successful than a model you build from scratch.

Subscribe To Our Newsletter

Subscribe To Our Newsletter

If you would like to have access to full codes of the Medium Posts on Google Colab and the rest of my latest content, just fill in the form below:

You have Successfully Subscribed!