Improving Our Code to Obtain Better Results for Kaggle’s Titanic Competition with Data Analysis & Visualization and Gradient Boosting Algorithm
In Part-I of this tutorial, we developed a small python program with less than 20 lines that allowed us to enter the first Kaggle competition.
Complete Your First Kaggle Competition in Less Than 20 Lines of Code with Decision Tree Classifier | Machine Learning…towardsdatascience.com
However, this model did not perform very well since we did not make good data exploration and preparation to understand the data and structure the model better. In Part-II of the tutorial, we will explore the dataset using Seaborn and Matplotlib. Besides, new concepts will be introduced and applied for a better performing model. Finally, we will increase our ranking in the second submission.
Using Jupyter or Google Colab Notebook
As mentioned in Part-I, you need to install Python on your system to run any Python code. Also, you need to install libraries such as Numpy, Pandas, Matplotlib, Seaborn. Also, you need an IDE (text editor) to write your code. You may use your choice of IDE, of course. However, I strongly recommend installing Jupyter Notebook with Anaconda Distribution. Jupyter Notebook utilizes iPython, which provides an interactive shell, which provides a lot of convenience for testing your code. So, you should definitely check it if you are not already using it.
Google Colab Notebook
Google Colab is built on top of the Jupyter Notebook and gives you cloud computing capabilities. Instead of completing all the steps above, you can create a Google Colab notebook, which comes with the libraries pre-installed. So, it is much more streamlined. I recommend Google Colab over Jupyter, but in the end, it is up to you.
Exploring Our Data
To be able to create a good model, firstly, we need to explore our data. Seaborn, a statistical data visualization library, comes in pretty handy. First, let’s remember how our dataset looks like:
and this is the explanation of the variables you see above:
So, now it is time to explore some of these variables’ effects on survival probability!
Our first suspicion is that there is a correlation between a person’s gender (male-female) and his/her survival probability. To be able to understand this relationship, we create a bar plot of the males & females categories against survived & not-survived labels:
As you can see in the plot, females had a greater chance of survival compared to males. Therefore, gender must be an explanatory variable in our model.
Secondly, we suspect that there is a correlation between the passenger class and survival rate as well. When we plot Pclass against Survival, we obtain the plot below:
Just as we suspected, passenger class has a significant influence on one’s survival chance. It seems that if someone is traveling in third class, it has a great chance of non-survival. Therefore, Pclass is definitely explanatory on survival probability.
Thirdly, we also suspect that the number of siblings aboard (SibSp) and the number of parents aboard (Parch) are also significant in explaining the survival chance. Therefore, we need to plot SibSp and Parch variables against Survival, and we obtain this:
So, we reach this conclusion: As the number of siblings on board or number of parents on board increases, the chances of survival increase. In other words, people traveling with their families had a higher chance of survival.
Another potential explanatory variable (feature) of our model is the Embarked variable. When we plot Embarked against the Survival, we obtain this outcome:
It is clearly visible that people who embarked on Southampton Port were less fortunate compared to the others. Therefore, we will also include this variable in our model.
So far, we checked 5 categorical variables (Sex, Plclass, SibSp, Parch, Embarked), and it seems that they all played a role in a person’s survival chance.
Now it is time to work on our numerical variables Fare and Age. First of all, we would like to see the effect of Age on Survival chance. Therefore, we plot the Age variable (seaborn.distplot):
We can see that the survival rate is higher for children below 18, while for people above 18 and below 35, this rate is low. Age plays a role in Survival.
Finally, we need to see whether the Fare helps explain the Survival probability. Therefore, we plot the Fare variable (seaborn.distplot):
In general, we can see that as the Fare paid by the passenger increases, the chance of survival increases, as we expected.
We will ignore three columns: Name, Cabin, Ticket since we need to use more advanced techniques to include these variables in our model. To give an idea of how to extract features from these variables: You can tokenize the passenger’s Names and derive their titles. Apart from titles like Mr. and Mrs., you will find other titles such as Master or Lady, etc. Surely, this played a role in who to save during that night. Therefore, you can take advantage of the given Name column as well as Cabin and Ticket columns.
Checking the Data for Null Values
Null values are our enemies! In the Titanic dataset, we have some missing values. First of all, we will combine the two datasets after dropping the training dataset’s Survived column.
We need to get information about the null values! There are two ways to accomplish this: .info() function and heatmaps (way cooler!). To be able to detect the nulls, we can use seaborn’s heatmap with the following code:
Here is the outcome. Yellow lines are the missing values.
There are a lot of missing Age and Cabin values. Two values are missing in the Embarked column while one is missing in the Fare column. Let’s take care of these first. Alternatively, we can use the .info() function to receive the same information in text form:
Reading the Datasets
We will not get into the details of the dataset since it was covered in Part-I. Using the code below, we can import Pandas & Numpy libraries and read the train & test CSV files.
As we know from the above, we have null values in both train and test sets. We need to impute these null values and prepare the datasets for the model fitting and prediction separately.
Imputing Null Values
There are two main approaches to solve the missing values problem in datasets: drop or fill. Drop is the easy and naive way out; although, sometimes it might actually perform better. In our case, we will fill them unless we have decided to drop a whole column altogether.
The initial look of our dataset is as follows:
After running this code on the train dataset, we get this:
There are no null values, no strings, or categories that would get in our way. Now, we can split the data into two, Features (X or explanatory variables) and Label (Y or response variable), and then we can use the sklearn’s train_test_split() function to make the train test splits inside the train dataset.
Note: We have another dataset called test. This isn’t very clear due to the naming made by Kaggle. We are training and testing our model using the train dataset by splitting it into X_train, X_test, y_train, y_test DataFrames, and then applying the trained model on our test dataset to generate a predictions file.
Creating a Gradient Boosting Model and Train
In Part-I, we used a basic Decision Tree model as our machine learning algorithm. Another well-known machine learning algorithm is Gradient Boosting Classifier, and since it usually outperforms Decision Tree, we will use Gradient Boosting Classifier in this tutorial. The code shared below allows us to import the Gradient Boosting Classifier algorithm, create a model based on it, fit and train the model using X_train and y_train DataFrames, and finally make predictions on X_test.
Now, we have the predictions, and we also know the answers since X_test is split from the train dataframe. To be able to measure our success, we can use the confusion matrix and classification report. You can achieve this by running the code below:
And this is the output:
We obtain about 82% accuracy, which may be considered pretty good, although there is still room for improvement.
Create the Prediction File for the Kaggle Competition
Now, we have a trained and working model that we can use to predict the passenger’s survival probabilities in the test.csv file.
First, we will clean and prepare the data with the following code (quite similar to how we clean the training dataset). Just note that we save PassengerId columns as a separate dataframe before removing it under the name ‘ids’.
Finally, we can predict the Survival values of the test dataframe and write to a CSV file as required with the following code.
There you have a new and better model for Kaggle competition. We made several improvements in our code, which increased the accuracy by around 15–20%, which is a good improvement. As I mentioned above, there is still some room for improvement, and the accuracy can increase to around 85–86%. However, the scoreboard scores are not very reliable, in my opinion, since many people used dishonest techniques to increase their ranking.
Part III of the This Mini-Series
In Part III, we will use more advanced techniques such as Natural Language Processing (NLP), Deep Learning, and GridSearchCV to increase our accuracy in Kaggle’s Titanic Competition.
Using Natural Language Processing (NLP), Deep Learning, and GridSearchCV in Kaggle’s Titanic Competition | Machine…towardsdatascience.com
Since you are reading this article, I am sure that we share similar interests and are/will be in similar industries. So let’s connect via Linkedin! Please do not hesitate to send a contact request! Orhan G. Yalçın — Linkedin