Subscribe to Get All the Blog Posts and Colab Notebooks 

Using Natural Language Processing (NLP), Deep Learning, and GridSearchCV in Kaggle’s Titanic Competition | Machine Learning Tutorials

Figure 1. Titanic Under Construction on Unsplash

If you follow my tutorial series on Kaggle’s Titanic Competition (Part-I and Part-II) or have already participated in the Competition, you are familiar with the whole story. If you are not familiar with it, since this is a follow-up tutorial, I strongly recommend you to check out the Competition Page or Part-I and Part-II of this tutorial series. In Part-III (Final) of the series, (i) we will use natural language processing (NLP) techniques to obtain the titles of the passengers, (ii) create an Artificial Neural Network (ANN or RegularNet) to train the model, and (iii) use Grid Search Cross-Validation to tune the ANN so that we get the best results.

Let’s start!

Background

Part-I of the Tutorial

Throughout this tutorial series, we try to keep things simple and develop the story slowly and clearly. In Part-I of the tutorial, we learned to write a python program with less than 20 lines to enter the Kaggle’s Competition. Things were kept as simple as possible. We cleaned the non-numerical parts, took care of the null values, trained our model using the train.csv file, predicted the passenger’s survival in the test.csv file, and saved it as a CSV file for submission.

Part-II of the Tutorial

Since we did not explore the dataset properly in Part-I, we focus on data exploration in Part-II using Matplotlib and Seaborn. We impute the null values instead of dropping them by using aggregated functions, better cleaned the data, and finally generated the dummy variables from the categorical variables. Then, we use a RandomForestClassifier model instead of LogisticRegression, which also improves precision. We achieve an approximately 20% increase in precision compared to the model in Part-I.

Part-III of the Tutorial

Figure 2. A Diagram of an Artificial Neural Network with one Hidden Layer (Figure by Author)

We will now use the Name column to derive the passengers’ titles, which played a significant role in their survival chances. We will also create an Artificial Neural Network (ANN or RegularNets) with Keras to obtain better results. Finally, to tune the ANN model, we will use GridSearchCV to detect the best parameters. Finally, we will generate a new CSV file for submission.

Preparing the Dataset

Like what we have done in Part-I and Part-II, I will start cleaning data and imputing the null values. This time, we will adopt a different approach and combine the two datasets for cleaning and imputing. We already covered why we impute the null values the way we do in Part-II; therefore, we will give you the code straight away. If you feel that some operations do not make sense, you may refer to Part-II or comment below. However, since we saw in Part-II that people younger than 18 had a greater chance of survival, we should add a new feature to measure this effect.

Data Cleaning and Null Value Imputation

Deriving Passenger Titles with NLP

We will drop the unnecessary columns and generate the dummy variables from the categorical variables above. But first, we need to extract the titles from the ‘Name’ column. To understand what we are doing, we will start by running the following code to get the first 10 rows Name column values.

Name Column Values of the First 10 Rows

And here is what we get:

Figure 3. Name Column Values of the First 10 Row (Figure by Author)

The structure of the name column value is as follows:

<Last-Name>,<Title>.<First-Name>

Therefore, we need to split these String based on the dot and comma and extract the title. We can accomplish this with the following code:

Splitting the Name Values to Extract Titles

Once we run this code, we will have a Title column with titles in it. To be able to see what kind of titles do we have, we will run this:

Grouping Titles and Get the Counts

Figure 4. Unique Title Counts (Figure by Author)

It seems that we have four major groups: ‘Mr’, ‘Mrs’, ‘Miss’, ‘Master’, and others. However, before grouping all the other titles as Others, we need to take care of the French titles. We need to convert them to their corresponding English titles with the following code:

French to English Title Converter

Now, we only have officers and royal titles. It makes sense to combine them as Others. We can achieve this with the following code:

Combining All the Non-Major Titles as Others (Contains Officer and Royal Titles)

Figure 5. Final Unique Title Counts (Figure by Author)

Final Touch on Data Preparation

Now that our Titles are more manageable, we can create dummies and drop the unnecessary columns with the following code:

Final Touch on Data Preparation

Creating an Artificial Neural Network for Training

Figure 6. A Diagram of an Artificial Neural Network with Two Hidden Layers (Figure by Author)

Standardizing Our Data with Standard Scaler

To get a good result, we must scale our data by using Scikit Learn’s Standard Scaler. Standard Scaler standardizes features by removing the mean and scaling to unit variance (i.e., standardization), which is different than MinMaxScaler. The mathematical difference between Standardization and Normalizer is as follows:

Figure 7. Standardization vs. Normalization (Figure by Author)

We will choose StandardScaler() for scaling our dataset and run the following code:

Scaling Train and Test Datasets

Building the ANN Model

After standardizing our data, we can start building our artificial neural network. We will create one Input Layer (Dense), one Output Layer (Dense), and one Hidden Layer (Dense). After each layer until the Output Layer, we will apply 0.2 Dropout for regularization to fight over-fitting. Finally, we will build the model with Keras Classifier to apply GridSearchCV on this neural network. As we have 14 explanatory variables, our input_dimension must be equal to 14. Since we will make binary classification, our final output layer must output a single value for Survived or Not-Survived classification. The other units in between are “try-and-see” values, and we selected 128 neurons.

Building an ANN with Keras Classifier

Grid Search Cross-Validation

After building the ANN, we will use scikit-learn GridSearchCV to find the best parameters and tune our ANN to get the best results. We will try different optimizers, epochs, and batch_sizes with the following code.

Grid Search with Keras Classifier

After running this code and printing out the best parameters, we get the following output:

Figure 8. Best Parameters and the Accuracy

Please note that we did not activate the Cross-Validation in the GridSearchCV. If you would like to add cross-validation functionality to GridSearchCV, select a cv value inside the GridSearch (e.g., cv=5).

Fitting the Model with Best Parameters

Now that we found the best parameters, we can re-create our classifier with the best parameter values and fit our training dataset with the following code:

Fitting with the Best Parameters

Since we obtain the prediction, we may conduct the final operations to make it ready for submission. One thing to note is that our ANN gives us the probabilities of survival, which is a continuous numerical variable. However, we need a binary categorical variable. Therefore, we are also making the necessary operation with the lambda function below to convert the continuous values to binary values (0 or 1) and writing the results to a CSV file.

Creating the Submission File

Congratulations

Figure 9. Deep Learning vs. Older Algorithms (Figure by Author)

You have created an artificial neural network to classify the Survivals of titanic passengers. Neural Networks are proved to outperform all the other machine learning algorithms as long as there is a large volume of data. Since our dataset only consists of 1309 lines, some machine learning algorithms such as Gradient Boosting Tree or Random Forest with good tuning may outperform neural networks. However, for datasets with large volumes, this will not be the case, as you may see on the chart below:

I would say that Titanic Dataset may be on the left side of the intersection of where older algorithms outperform deep learning algorithms. However, we will still achieve an accuracy rate higher than 80%, around the natural accuracy level.

Subscribe To Our Newsletter

Subscribe To Our Newsletter

If you would like to have access to full codes of the Medium Posts on Google Colab and the rest of my latest content, just fill in the form below:

You have Successfully Subscribed!