Complete Your First Kaggle Competition in Less Than 20 Lines of Code with Decision Tree Classifier | Machine Learning Tutorials
Since you are reading this article, I am sure that we share similar interests and are/will be in similar industries. So let’s connect via Linkedin! Please do not hesitate to send a contact request! Orhan G. Yalçın — Linkedin
If you are interested in machine learning, you have probably heard of Kaggle. Kaggle is a platform where you can learn a lot about machine learning with Python and R, do data science projects, and (this is the most fun part) join machine learning competitions. Competitions are changed and updated over time. Currently, “Titanic: Machine Learning from Disaster” is “the beginner’s competition” on the platform. In this post, we will create a ready-to-upload submission file with less than 20 lines of Python code. To be able to this, we will use Pandas and Scikit-Learn libraries.
Titanic RMS and the Infamous Accident
RMS Titanic was the largest ship afloat when it entered service, and it sank after colliding with an iceberg during its first voyage to the United States on 15 April 1912. There were 2,224 passengers and crew aboard during the voyage, and unfortunately, 1,502 of them died. It was one of the deadliest commercial peacetime maritime disasters in the 20th century.
One of the main reasons for such a high number of casualties was the lack of sufficient lifeboats for the passengers and the crew. Although luck played a part in surviving the accident, some people such as women, children, and the upper-class passengers were more likely to survive than the rest. We will calculate this likelihood and effect of having particular features on the likelihood of surviving. And we will accomplish this in less than 20 lines of code and have a file ready for submission. … Let’s Get Started!
Download the Data
The Titanic dataset is an open dataset where you can reach from many different repositories and GitHub accounts. However, downloading from Kaggle will definitely be the best choice as the other sources may have slightly different versions and may not offer separate train and test files. So, please visit this link to download the datasets (Train.csv and Test.csv) to get started.
Normally our Train.csv file looks like this in Excel:
After converting it to the table in Excel (Data->Text to Columns), we get this view:
Way nicer, right! Now, we can clearly see that we have 12 variables. While the “Survived” variable represents whether a particular passenger survived the accident, the rest is the essential information about this passenger. Here is a brief explanation of the variables:
Load and Process The Training Data
I assume that you have your Python environment installed. However, if you don’t have Python on your computer, you may refer to this link for Windows and this link for macOS. After making sure that you have Python installed on your system, open your favorite IDE, and start coding!
Note that using a Google Colab Notebook is another option, which does not require local Python3 installation. To have access to the Google Colab Notebook with the full code, consider signing up to the Newsletter using the slider below.
First, we will load the training data for cleaning and getting it ready for training our model. We will (i) load the data, (ii) delete the rows with empty values, (iii) select the “Survival” column as my response variable, (iv) drop the for-now irrelevant explanatory variables, (v) convert categorical variables to dummy variables, and we will accomplish all this with 7 lines of code:
Create the Model and Train
To uncover the relationship between the Survival variable and other variables (or features if you will), you need to select a statistical machine learning model and train your model with the processed data.
Scikit-learn provides several algorithms for this. We will select the DecisionTreeClassifier, which is a basic but powerful algorithm for machine learning. And get this: We will only need 3 lines of code to reveal the hidden relationship between Survival (denoted as y) and the selected explanatory variables (denoted as X)
Make Predictions and Save Your Results
We may prepare our testing data for the prediction phase after revealing the hidden relationship between Survival and the selected explanatory variables. Test.csv file is slightly different than the Train.csv file: It does not contain the “Survival” column. This makes sense because if we would know all the answers, we could have just faked our algorithm and submit the correct answers after writing by hand (wait! some people somehow have already done that?). Anyway, our testing data needs almost the same kind of cleaning, massaging, prepping, and preprocessing for the prediction phase. We will accomplish this with 5 lines of code:
Now our test data is clean and prepared for prediction. Finally, make the predictions for the given test file and save it to memory:
So easy, right! Before saving these predictions, we need to obtain proper structure so that Kaggle can automatically score our predictions. Remember, we saved the PassengerId column to the memory as a separate dataset (DataFrame, if you will)? Now we will assign (or attach) the predictions dataset to PassengerIds (note that they are both single-column datasets). Finally, we will get the data from memory and save it in CSV (comma separated values) format required by Kaggle.
Now you can visit Kaggle’s Titanic competition page, and after login, you can upload your submission file.
Will You Make It to the Top?
Definitely not! We tried to implement a simple machine learning algorithm enabling you to enter a Kaggle competition. As you improve this basic code, you will be able to rank better in the following submissions.