Using Natural Language Processing (NLP), Deep Learning, and GridSearchCV in Kaggle’s Titanic Competition | Machine Learning Tutorials
If you follow my tutorial series on Kaggle’s Titanic Competition (Part-I and Part-II) or have already participated in the Competition, you are familiar with the whole story. If you are not familiar with it, since this is a follow-up tutorial, I strongly recommend you to check out the Competition Page or Part-I and Part-II of this tutorial series. In Part-III (Final) of the series, (i) we will use natural language processing (NLP) techniques to obtain the titles of the passengers, (ii) create an Artificial Neural Network (ANN or RegularNet) to train the model, and (iii) use Grid Search Cross-Validation to tune the ANN so that we get the best results.
Throughout this tutorial series, we try to keep things simple and develop the story slowly and clearly. In Part-I of the tutorial, we learned to write a python program with less than 20 lines to enter the Kaggle’s Competition. Things were kept as simple as possible. We cleaned the non-numerical parts, took care of the null values, trained our model using the train.csv file, predicted the passenger’s survival in the test.csv file, and saved it as a CSV file for submission.
Complete Your First Kaggle Competition in Less Than 20 Lines of Code with Decision Tree Classifier | Machine Learning…towardsdatascience.com
Since we did not explore the dataset properly in Part-I, we focus on data exploration in Part-II using Matplotlib and Seaborn. We impute the null values instead of dropping them by using aggregated functions, better cleaned the data, and finally generated the dummy variables from the categorical variables. Then, we use a RandomForestClassifier model instead of LogisticRegression, which also improves precision. We achieve an approximately 20% increase in precision compared to the model in Part-I.
Improving Our Code to Obtain Better Results for Kaggle’s Titanic Competition with Data Analysis & Visualization and…towardsdatascience.com
Part-III of the Tutorial
We will now use the Name column to derive the passengers’ titles, which played a significant role in their survival chances. We will also create an Artificial Neural Network (ANN or RegularNets) with Keras to obtain better results. Finally, to tune the ANN model, we will use GridSearchCV to detect the best parameters. Finally, we will generate a new CSV file for submission.
Preparing the Dataset
Like what we have done in Part-I and Part-II, I will start cleaning data and imputing the null values. This time, we will adopt a different approach and combine the two datasets for cleaning and imputing. We already covered why we impute the null values the way we do in Part-II; therefore, we will give you the code straight away. If you feel that some operations do not make sense, you may refer to Part-II or comment below. However, since we saw in Part-II that people younger than 18 had a greater chance of survival, we should add a new feature to measure this effect.
Deriving Passenger Titles with NLP
We will drop the unnecessary columns and generate the dummy variables from the categorical variables above. But first, we need to extract the titles from the ‘Name’ column. To understand what we are doing, we will start by running the following code to get the first 10 rows Name column values.
And here is what we get:
The structure of the name column value is as follows:
Therefore, we need to split these String based on the dot and comma and extract the title. We can accomplish this with the following code:
Once we run this code, we will have a Title column with titles in it. To be able to see what kind of titles do we have, we will run this:
It seems that we have four major groups: ‘Mr’, ‘Mrs’, ‘Miss’, ‘Master’, and others. However, before grouping all the other titles as Others, we need to take care of the French titles. We need to convert them to their corresponding English titles with the following code:
Now, we only have officers and royal titles. It makes sense to combine them as Others. We can achieve this with the following code:
Final Touch on Data Preparation
Now that our Titles are more manageable, we can create dummies and drop the unnecessary columns with the following code:
Creating an Artificial Neural Network for Training
Standardizing Our Data with Standard Scaler
To get a good result, we must scale our data by using Scikit Learn’s Standard Scaler. Standard Scaler standardizes features by removing the mean and scaling to unit variance (i.e., standardization), which is different than MinMaxScaler. The mathematical difference between Standardization and Normalizer is as follows:
We will choose StandardScaler() for scaling our dataset and run the following code:
Building the ANN Model
After standardizing our data, we can start building our artificial neural network. We will create one Input Layer (Dense), one Output Layer (Dense), and one Hidden Layer (Dense). After each layer until the Output Layer, we will apply 0.2 Dropout for regularization to fight over-fitting. Finally, we will build the model with Keras Classifier to apply GridSearchCV on this neural network. As we have 14 explanatory variables, our input_dimension must be equal to 14. Since we will make binary classification, our final output layer must output a single value for Survived or Not-Survived classification. The other units in between are “try-and-see” values, and we selected 128 neurons.
Grid Search Cross-Validation
After building the ANN, we will use scikit-learn GridSearchCV to find the best parameters and tune our ANN to get the best results. We will try different optimizers, epochs, and batch_sizes with the following code.
After running this code and printing out the best parameters, we get the following output:
Please note that we did not activate the Cross-Validation in the GridSearchCV. If you would like to add cross-validation functionality to GridSearchCV, select a cv value inside the GridSearch (e.g., cv=5).
Fitting the Model with Best Parameters
Now that we found the best parameters, we can re-create our classifier with the best parameter values and fit our training dataset with the following code:
Since we obtain the prediction, we may conduct the final operations to make it ready for submission. One thing to note is that our ANN gives us the probabilities of survival, which is a continuous numerical variable. However, we need a binary categorical variable. Therefore, we are also making the necessary operation with the lambda function below to convert the continuous values to binary values (0 or 1) and writing the results to a CSV file.
You have created an artificial neural network to classify the Survivals of titanic passengers. Neural Networks are proved to outperform all the other machine learning algorithms as long as there is a large volume of data. Since our dataset only consists of 1309 lines, some machine learning algorithms such as Gradient Boosting Tree or Random Forest with good tuning may outperform neural networks. However, for datasets with large volumes, this will not be the case, as you may see on the chart below:
I would say that Titanic Dataset may be on the left side of the intersection of where older algorithms outperform deep learning algorithms. However, we will still achieve an accuracy rate higher than 80%, around the natural accuracy level.