Figure 1. Photo by Marian Kroell on Unsplash
Using the “Valence Aware Dictionary and sEntiment Reasoner” on the IMDB Reviews Dataset for Rule-based Sentiment Analysis
For a long time, I have been writing on statistical NLP topics and sharing tutorials. The sub-field of statistical NLP is responsible for several impressive advancements in the field of natural language processing, and it has the highest potential among competing approaches. However, in some cases, the contribution of rule-based classical natural language processing might be sought.
When to Use Rule-Based Approach Instead of Statistical NLP
In cases where researchers have deep pockets with lots of talented researchers and dealing with general problems, statistical NLP is usually the preferred way to tackle the NLP problem. But, in the following cases, the rule-based approach might be fruitful:
1 — Domain-Specific Problem:
We have great pre-trained models such as GPT-3, BERT, ELMo, which do wonders on generic language problems. However, when we try to use them in domain-specific problems such as financial news sentiment analysis or legal text classification, the specificity required for such tasks may not be satisfied by these state-of-the-art models. Therefore, we either have to fine-tune these models with additional labeled data or rely on rule-based models.
2 — Lack of Labeled Data:
Even though we might want to fine-tune a model, it may not always be possible. Especially, if you are with a small team or don’t have the funds to hire people via freelancing platforms such as Amazon Mechanical Turk, you cannot generate labeled data to fine-tune a pre-trained model, not to mention build your own deep learning model. Lastly, it may not be possible to collect a meaningful amount of data to train a deep learning model. In the end, statistical NLP models are very data-hungry.
3 — Limited Available Funding for Training:
Even though you have some available labeled specific data, training a dedicated model has its own cost. Not only that, you would need a group of star data scientists, but you also need distributed-servers to train your model, and your pockets may not be that deep.
If you have one of these issues, your best bet might be rule-based NLP, and the accuracy levels of rule-based NLPs are not as bad as you might think. In this post, we will build a simple Lexicon-based Sentiment Classifier without much tuning, and we will achieve an acceptable accuracy performance, which may be increased even further.
Before starting, though, let’s cover some basics:
What is a Lexicon?
Lexicon sounds like a fancy technical term, but it means a dictionary, usually in a particular domain. In other words:
A lexicon is the vocabulary of a person, language, or branch of knowledge.
In a rule-based NLP study for sentiment analysis, we need a lexicon that serves as a reference manual to measure the sentiment of a chunk of text (e.g., word, phrase, sentence, paragraph, full text). Lexicon-based sentiment analysis can be as simple as positive-labeled words minus negative-labeled words to see if a text has a positive sentiment. It can also be very complex with negation rules, distance calculations, added-variance, and several additional rules. One of the main differences between rule-based NLP and statistical NLP is that in rule-based NLP, the researcher is completely free to add any rule they deem useful. Therefore, in rule-based NLP, what we usually see is that highly trained experts develop theory-based rules in a particular domain and apply them to a particular problem in this particular domain.
What is VADER?
One of the most popular rule-based sentiment analysis models is VADER. VADER, or Valence Aware Dictionary and sEntiment Reasoner, is a lexicon and rule-based sentiment analysis tool specifically attuned to sentiments expressed in social media.
VADER is like the GPT-3 of Rule-Based NLP Models.
Since it is tuned for social media content, it performs best on the content you can find on social media. However, it still offers acceptable F1 Scores on other test sets, and provides a comparable performance compared to complex statistical models such as Support Vector Machines, as you can see below:
Note that there are several alternative lexicons that you can use for your project, such as Harvard’s General Inquirer, Loughran McDonald, Hu & Liu. In this tutorial, we will adopt the VADER’s lexicon along with its methodology.
Now that you have a basic understanding of rule-based NLP models, we can proceed with our tutorial. This tutorial will approach a classic sentiment analysis problem from a rule-based NLP perspective: A Lexicon-based sentiment analysis on the IMDB Reviews Dataset.
Sentiment Analysis on IMDB Dataset
What is IMDB Reviews Dataset?
IMDB Reviews Dataset is a large movie review dataset collected and prepared by Andrew L. Maas from the popular movie rating service, IMDB. The IMDB Reviews dataset is used for binary sentiment classification, whether a review is positive or negative. It contains 25,000 movie reviews for training and 25,000 for testing. All these 50,000 reviews are labeled data that may be used for supervised deep learning. Besides, there is an additional 50,000 unlabeled reviews that we will not use in this case study. In this case study, we will only use the training dataset.
Loading and Processing the Dataset
We will start by loading the IMDB dataset by using Keras’s Data API. However, Keras provides the dataset in the encoded version. Luckily we can also load the index dictionary to decode it to original reviews. The following lines will load the encoded reviews along with the index. We will also create the reverse index for decoding:
Before decoding the entire dataset, let’s see the operation with an example:
Output: this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came...
As you can see, we can decode our encoded reviews using the reversed index. If we can decode one review, for all the reviews, all we need is a for loop. With the code below, we will create a nested list in which we place the sentiment label and the decoded review text. We also need to do error handling due to a typo in the dataset (apparently Keras team encoded one of the words wrong :/). But, the code below handles this error as well:
Finally, we will create a pandas DataFrame from the nested list we created above:
Now that our data is ready, we can load VADER.
Loading VADER Sentiment Intensity Analyzer
Python’s most advanced NLP tool, NLTK, provides a module for VADER, and we can easily import a SentimentIntensityAnalzer with the following code:
To test how our model works, let’s feed a simple sentence with neutral and negative words:
We use the polarity_score calculator of our SentimentIntensityAnalyzer model. This model gives us four scores: (i) Negativity, (ii) Positivity, (iii) Neutrality score of the sentence, and finally, (iv) Compound sentiment score of the sentence. The compound score is basically an aggregated version of the first three scores, and we will be using this score to measure the sentiment of our reviews. Here is the output of our dummy sentence: “Hello, world. I am terrible”.
Calculating Polarity Scores and Predicting:
Since we successfully calculated the sentiment score of a single sentence, all we have to do is run a loop for the entire dataset. Before calculating the score, although not necessary, I will shuffle the dataset first:
Instead of for loop, we will use a more efficient alternative: apply a lambda function on our dataset column Text and create a new column to save the results with the name Prediction. The following single line does these:
Editing Labels and Creating Accuracy Column:
The code above will simply convert negative compound scores to -1 and positive compound scores to 1. Since our IMDB Reviews dataset has 1 for positive sentiments and 0 for negative sentiments, we will change 0s with -1s so that we can calculate the accuracy of our predictions, which will be in a new column, called Accuracy:
Create a Column for Confusion Matrix
Finally, I want to create a confusion matrix to properly measure our rule-based NLP sentiment classifier’s success. A confusion matrix shows True Positives, True Negatives, False Positives, and False Negatives, which we can use to calculate Accuracy, Recall, Precision, and F1 Scores. With the lines below, I will create a custom function to generate confusion matrix tags and apply them as a lambda function, as I did above:
The code above will create a column, Conf_Matrix, and tag each prediction with abbreviations of True Positives, True Negatives, False Positives, and False Negatives.
Let’s see how the tail of our final DataFrame looks:
Calculating Accuracy, Recall, Precision, and F1 Score:
To see how we did with our VADER model, I will use several custom formulas to calculate Accuracy, Recall, Precision, and F1 Score. Although there are several API solutions for confusion matrix calculations, I decided to use custom calculations:
Thanks to the Conf_Matrix column, these calculations are straightforward to handle, and here are the results:
As you can see, without a single second of training or customization, we achieved a 70% accuracy on the movie reviews dataset. Don’t forget that VADER is a social-media lexicon. Therefore, using a movie-reviews-based Lexicon would give us even higher performances.
You have successfully built a sentiment classifier that is based on rule-based NLP. One of the biggest advantages of rule-based NLP methods is that they are fully explainable as opposed to glamorous transformer-based NLP models such as BERT and GPT-3. Therefore, apart from its budget-friendly nature, the model’s explainability is another great reason to rely on rule-based NLP models, especially in sensitive areas.