Subscribe to Get All the Blog Posts and Colab Notebooks 

Predict Tomorrow’s Bitcoin (BTC) Price with Recurrent Neural Networks

Predict Tomorrow’s Bitcoin (BTC) Price with Recurrent Neural Networks

Using Recurrent Neural Networks to Predict Next Days Cryptocurrency Prices with TensorFlow and Keras | Supervised Deep Learning

If you are reading this article, I am sure that we share similar interests and are/will be in similar industries. So let’s connect via Linkedin! Please do not hesitate to send a contact request! Orhan G. Yalçın — Linkedin

  

Photo by Andre Francois on Unsplash

Wouldn’t it be awesome if you were, somehow, able to predict tomorrow’s Bitcoin (BTC) price? As you all know, the cryptocurrency market has experienced tremendous volatility over the last year. The value of Bitcoin reached its peak on December 16, 2017, by climbing to nearly $20,000, and then it has seen a steep decline at the beginning of 2018. Not long ago, though, a year ago, to be precise, its value was almost half of what it is today. Therefore, if we look at the yearly BTC price chart, we may easily see that the price is still high. The fact that only two years ago, BTC’s value was only the one-tenth of its current value is even more shocking. You may personally explore the historical BTC prices using this plot below:

Historical Bitcoin (BTC) Prices by CoinDesk

There are several conspiracies regarding the precise reasons behind this volatility, and these theories are also used to support the prediction reasoning of crypto prices, particularly of BTC. These subjective arguments are valuable to predict the future of cryptocurrencies. On the other hand, our methodology evaluates historical data to predict the cryptocurrency prices from an algorithmic trading perspective. We plan to use numerical historical data to train a recurrent neural network (RNN) to predict BTC prices.

Obtaining the Historical Bitcoin Prices

There are quite a few resources we may use to obtain historical Bitcoin price data. While some of these resources allow the users to download CSV files manually, others provide an API that one can hook up to his code. Since when we train a model using time series data, we would like it to make up-to-date predictions, I prefer to use an API to obtain the latest figures whenever we run our program. After a quick search, I have decided to use CoinRanking.com’s API, which provides up-to-date coin prices that we can use in any platform.

Recurrent Neural Networks

Since we are using a time series dataset, it is not viable to use a feedforward neural network as tomorrow’s BTC price is most correlated with today’s, not a month ago’s.

A recurrent neural network (RNN) is a class of artificial neural network where connections between nodes form a directed graph along a sequence. — Wikipedia

An RNN shows temporal dynamic behavior for a time sequence, and it can use its internal state to process sequences. In practice, this can be achieved with LSTMs and GRUs layers.

Here you can see the difference between a regular feedforward-only neural network and a recurrent neural network (RNN):

 RNN vs. Regular Nets by Niklas Donges on TowardsDataScience
 

Our Roadmap

To be able to create a program that trains on the historical BTC prices and predict tomorrow’s BTC price, we need to complete several tasks as follows:

1 — Obtaining, Cleaning, and Normalizing the Historical BTC Prices

2 — Building an RNN with LSTM

3 — Training the RNN and Saving The Trained Model

4 — Predicting Tomorrow’s BTC Price and “Deserialize” It

BONUS: Deserializing the X_Test Predictions and Creating a Plot.ly Chart

Obtaining, Cleaning, and Normalizing the Historical BTC Prices

Obtaining the BTC Data

As I mentioned above, we will use CoinRanking.com’s API for the BTC dataset and convert it into a Pandas dataframe with the following code:

Obtaining the BTC Prices with CoinRanking API

This function is adjusted for 5-years BTC/USD prices by default. However, you may always change these values by passing in different parameter values.

Cleaning the Data with Custom Functions

After obtaining the data and converting it to a pandas dataframe, we may define custom functions to clean our data, normalize it for a neural network as it is a must for accurate results, and apply a custom train-test split. We created a custom train-test split function (not the scikit-learn’s) because we need to keep the time-series in order for training our RNN properly. We may achieve this with the following code, and you may find further function explanations in the code snippet below:

Defining custom functions for matrix creation, normalizing, and train-test split

After defining these functions, we may call them with the following code:

Calling the defined functions for data cleaning, preparation, and splitting

Building an RNN with LSTM

After preparing our data, it is time for building the model that we will later train by using the cleaned&normalized data. We will start by importing our Keras components and setting some parameters with the following code:

Setting the RNN Parameters in Advance

Then, we will create our Sequential model with two LSTM and two Dense layers with the following code:

Creating a Sequential Model and Filling it with LSTM and Dense Layers

Training the RNN and Saving The Trained Model

Now it is time to train our model with the cleaned data. You can also measure the time spent during the training. Follow these codes:

Training the RNN Model using the Prepared Data

Don’t forget to save it:

Saving the Trained Model

I am keen to save the model and load it later because it is quite satisfying to know that you can actually save a trained model and re-load it to use it next time. This is basically the first step for web or mobile integrated machine learning applications.

Predicting Tomorrow’s BTC Price and “Deserialize” It

After we train the model, we need to obtain the current data for predictions, and since we normalize our data, predictions will also be normalized. Therefore, we need to de-normalize back to their original values. Firstly, we will obtain the data in a similar, partially different, manner with the following code:

Loading the last 30 days’ BTC Prices

We will only have the normalized data for prediction: No train-test split. We will also reshape the data manually to be able to use it in our saved model.

After cleaning and preparing our data, we will load the trained RNN model for prediction and predict tomorrow’s price.

Loading the Trained Model and Making the Prediction

However, our results will vary between -1 and 1, which will not make a lot of sense. Therefore, we need to de-normalize them back to their original values. We can achieve this with a custom function:

We need a deserializer for the Original BTC Prediction Value in USD

After defining the custom function, we will call these function and extract tomorrow’s BTC prices with the following code:

Calling the deserializer and extracting the Price in USD

With the code above, you can actually get the model’s prediction for tomorrow’s BTC prices.

Deserializing the X_Test Predictions and Creating a Plot.ly Chart

You may also be interested in the overall result of the RNN model and prefer to see it as a chart. We can also achieve these by using our X_test data from the training part of the tutorial.

We will start by loading our model (consider this as an alternative to the single prediction case) and making the prediction on X_test data so that we can make predictions for a proper number of days for plotting with the following code:

Loading the Trained Model and Making Prediction Using the X_test Values

Next, we will import Plotly and set the properties for a good plotting experience. We will achieve this with the following code:

Importing Plotly and Setting the Parameters

After setting all the properties, we can finally plot our predictions and observation values with the following code:

Creating a Dataframe and Using it in Plotly’s iPlot

When you run this code, you will come up with the up-to-date version of the following plot:

 

 

Plot.ly Chart for BTC Price Predictions

How Reliable Are These Results?

As you can see, it does not look bad at all. However, you need to know that even though the patterns match pretty closely, the results are still dangerously apart from each other if you inspect the results on a day-to-day basis. Therefore, the code must be further developed to get better results.

Congratulations

You have successfully created and trained an RNN model that can predict BTC prices, and you even saved the trained model for later use. You may use this trained model on a web or mobile application by switching to Object-Oriented Programming. Pat yourself on the back for successfully developing a model relevant to artificial intelligence, blockchain, and finance. I think it sounds pretty cool to touch these areas all at once with this simple project.

Kaggle’s Titanic Competition in 10 Minutes | Part-III

Kaggle’s Titanic Competition in 10 Minutes | Part-III

Using Natural Language Processing (NLP), Deep Learning, and GridSearchCV in Kaggle’s Titanic Competition | Machine Learning Tutorials

Figure 1. Titanic Under Construction on Unsplash

If you follow my tutorial series on Kaggle’s Titanic Competition (Part-I and Part-II) or have already participated in the Competition, you are familiar with the whole story. If you are not familiar with it, since this is a follow-up tutorial, I strongly recommend you to check out the Competition Page or Part-I and Part-II of this tutorial series. In Part-III (Final) of the series, (i) we will use natural language processing (NLP) techniques to obtain the titles of the passengers, (ii) create an Artificial Neural Network (ANN or RegularNet) to train the model, and (iii) use Grid Search Cross-Validation to tune the ANN so that we get the best results.

Let’s start!

Background

Part-I of the Tutorial

Throughout this tutorial series, we try to keep things simple and develop the story slowly and clearly. In Part-I of the tutorial, we learned to write a python program with less than 20 lines to enter the Kaggle’s Competition. Things were kept as simple as possible. We cleaned the non-numerical parts, took care of the null values, trained our model using the train.csv file, predicted the passenger’s survival in the test.csv file, and saved it as a CSV file for submission.

Part-II of the Tutorial

Since we did not explore the dataset properly in Part-I, we focus on data exploration in Part-II using Matplotlib and Seaborn. We impute the null values instead of dropping them by using aggregated functions, better cleaned the data, and finally generated the dummy variables from the categorical variables. Then, we use a RandomForestClassifier model instead of LogisticRegression, which also improves precision. We achieve an approximately 20% increase in precision compared to the model in Part-I.

Part-III of the Tutorial

Figure 2. A Diagram of an Artificial Neural Network with one Hidden Layer (Figure by Author)

We will now use the Name column to derive the passengers’ titles, which played a significant role in their survival chances. We will also create an Artificial Neural Network (ANN or RegularNets) with Keras to obtain better results. Finally, to tune the ANN model, we will use GridSearchCV to detect the best parameters. Finally, we will generate a new CSV file for submission.

Preparing the Dataset

Like what we have done in Part-I and Part-II, I will start cleaning data and imputing the null values. This time, we will adopt a different approach and combine the two datasets for cleaning and imputing. We already covered why we impute the null values the way we do in Part-II; therefore, we will give you the code straight away. If you feel that some operations do not make sense, you may refer to Part-II or comment below. However, since we saw in Part-II that people younger than 18 had a greater chance of survival, we should add a new feature to measure this effect.

Data Cleaning and Null Value Imputation

Deriving Passenger Titles with NLP

We will drop the unnecessary columns and generate the dummy variables from the categorical variables above. But first, we need to extract the titles from the ‘Name’ column. To understand what we are doing, we will start by running the following code to get the first 10 rows Name column values.

Name Column Values of the First 10 Rows

And here is what we get:

Figure 3. Name Column Values of the First 10 Row (Figure by Author)

The structure of the name column value is as follows:

<Last-Name>,<Title>.<First-Name>

Therefore, we need to split these String based on the dot and comma and extract the title. We can accomplish this with the following code:

Splitting the Name Values to Extract Titles

Once we run this code, we will have a Title column with titles in it. To be able to see what kind of titles do we have, we will run this:

Grouping Titles and Get the Counts

Figure 4. Unique Title Counts (Figure by Author)

It seems that we have four major groups: ‘Mr’, ‘Mrs’, ‘Miss’, ‘Master’, and others. However, before grouping all the other titles as Others, we need to take care of the French titles. We need to convert them to their corresponding English titles with the following code:

French to English Title Converter

Now, we only have officers and royal titles. It makes sense to combine them as Others. We can achieve this with the following code:

Combining All the Non-Major Titles as Others (Contains Officer and Royal Titles)

Figure 5. Final Unique Title Counts (Figure by Author)

Final Touch on Data Preparation

Now that our Titles are more manageable, we can create dummies and drop the unnecessary columns with the following code:

Final Touch on Data Preparation

Creating an Artificial Neural Network for Training

Figure 6. A Diagram of an Artificial Neural Network with Two Hidden Layers (Figure by Author)

Standardizing Our Data with Standard Scaler

To get a good result, we must scale our data by using Scikit Learn’s Standard Scaler. Standard Scaler standardizes features by removing the mean and scaling to unit variance (i.e., standardization), which is different than MinMaxScaler. The mathematical difference between Standardization and Normalizer is as follows:

Figure 7. Standardization vs. Normalization (Figure by Author)

We will choose StandardScaler() for scaling our dataset and run the following code:

Scaling Train and Test Datasets

Building the ANN Model

After standardizing our data, we can start building our artificial neural network. We will create one Input Layer (Dense), one Output Layer (Dense), and one Hidden Layer (Dense). After each layer until the Output Layer, we will apply 0.2 Dropout for regularization to fight over-fitting. Finally, we will build the model with Keras Classifier to apply GridSearchCV on this neural network. As we have 14 explanatory variables, our input_dimension must be equal to 14. Since we will make binary classification, our final output layer must output a single value for Survived or Not-Survived classification. The other units in between are “try-and-see” values, and we selected 128 neurons.

Building an ANN with Keras Classifier

Grid Search Cross-Validation

After building the ANN, we will use scikit-learn GridSearchCV to find the best parameters and tune our ANN to get the best results. We will try different optimizers, epochs, and batch_sizes with the following code.

Grid Search with Keras Classifier

After running this code and printing out the best parameters, we get the following output:

Figure 8. Best Parameters and the Accuracy

Please note that we did not activate the Cross-Validation in the GridSearchCV. If you would like to add cross-validation functionality to GridSearchCV, select a cv value inside the GridSearch (e.g., cv=5).

Fitting the Model with Best Parameters

Now that we found the best parameters, we can re-create our classifier with the best parameter values and fit our training dataset with the following code:

Fitting with the Best Parameters

Since we obtain the prediction, we may conduct the final operations to make it ready for submission. One thing to note is that our ANN gives us the probabilities of survival, which is a continuous numerical variable. However, we need a binary categorical variable. Therefore, we are also making the necessary operation with the lambda function below to convert the continuous values to binary values (0 or 1) and writing the results to a CSV file.

Creating the Submission File

Congratulations

Figure 9. Deep Learning vs. Older Algorithms (Figure by Author)

You have created an artificial neural network to classify the Survivals of titanic passengers. Neural Networks are proved to outperform all the other machine learning algorithms as long as there is a large volume of data. Since our dataset only consists of 1309 lines, some machine learning algorithms such as Gradient Boosting Tree or Random Forest with good tuning may outperform neural networks. However, for datasets with large volumes, this will not be the case, as you may see on the chart below:

I would say that Titanic Dataset may be on the left side of the intersection of where older algorithms outperform deep learning algorithms. However, we will still achieve an accuracy rate higher than 80%, around the natural accuracy level.

Kaggle’s Titanic Competition in 10 Minutes | Part-II

Kaggle’s Titanic Competition in 10 Minutes | Part-II

Improving Our Code to Obtain Better Results for Kaggle’s Titanic Competition with Data Analysis & Visualization and Gradient Boosting Algorithm

In Part-I of this tutorial, we developed a small python program with less than 20 lines that allowed us to enter the first Kaggle competition.

However, this model did not perform very well since we did not make good data exploration and preparation to understand the data and structure the model better. In Part-II of the tutorial, we will explore the dataset using Seaborn and Matplotlib. Besides, new concepts will be introduced and applied for a better performing model. Finally, we will increase our ranking in the second submission.

 Figure 1. Sea Trials of RMS Titanic on Wikipedia

Using Jupyter or Google Colab Notebook

For your programming environment, you may choose one of these two options: Jupyter Notebook and Google Colab Notebook:

Jupyter Notebook

As mentioned in Part-I, you need to install Python on your system to run any Python code. Also, you need to install libraries such as Numpy, Pandas, Matplotlib, Seaborn. Also, you need an IDE (text editor) to write your code. You may use your choice of IDE, of course. However, I strongly recommend installing Jupyter Notebook with Anaconda Distribution. Jupyter Notebook utilizes iPython, which provides an interactive shell, which provides a lot of convenience for testing your code. So, you should definitely check it if you are not already using it.

Google Colab Notebook

Google Colab is built on top of the Jupyter Notebook and gives you cloud computing capabilities. Instead of completing all the steps above, you can create a Google Colab notebook, which comes with the libraries pre-installed. So, it is much more streamlined. I recommend Google Colab over Jupyter, but in the end, it is up to you.


Exploring Our Data

To be able to create a good model, firstly, we need to explore our data. Seaborn, a statistical data visualization library, comes in pretty handy. First, let’s remember how our dataset looks like:

 Table 1. Top 5 Rows of our Training Data (Table by Author)

and this is the explanation of the variables you see above:

 Table 2. Explanation of the Variables (Table by Author)

So, now it is time to explore some of these variables’ effects on survival probability!

Our first suspicion is that there is a correlation between a person’s gender (male-female) and his/her survival probability. To be able to understand this relationship, we create a bar plot of the males & females categories against survived & not-survived labels:

Figure 2. Survival Counts of Males and Females (Figure by Author)

As you can see in the plot, females had a greater chance of survival compared to males. Therefore, gender must be an explanatory variable in our model.

Secondly, we suspect that there is a correlation between the passenger class and survival rate as well. When we plot Pclass against Survival, we obtain the plot below:

Figure 3. Survival Counts of Different Passenger Classes (Figure by Author)

Just as we suspected, passenger class has a significant influence on one’s survival chance. It seems that if someone is traveling in third class, it has a great chance of non-survival. Therefore, Pclass is definitely explanatory on survival probability.

Thirdly, we also suspect that the number of siblings aboard (SibSp) and the number of parents aboard (Parch) are also significant in explaining the survival chance. Therefore, we need to plot SibSp and Parch variables against Survival, and we obtain this:

Figure 4. Survival Counts Based on Siblings and Parents on Board (Figure by Author)

So, we reach this conclusion: As the number of siblings on board or number of parents on board increases, the chances of survival increase. In other words, people traveling with their families had a higher chance of survival.

Another potential explanatory variable (feature) of our model is the Embarked variable. When we plot Embarked against the Survival, we obtain this outcome:

 Figure 5. Survival Counts Based on the Port of Embarkation (Figure by Author)

It is clearly visible that people who embarked on Southampton Port were less fortunate compared to the others. Therefore, we will also include this variable in our model.

So far, we checked 5 categorical variables (Sex, Plclass, SibSp, Parch, Embarked), and it seems that they all played a role in a person’s survival chance.


Now it is time to work on our numerical variables Fare and Age. First of all, we would like to see the effect of Age on Survival chance. Therefore, we plot the Age variable (seaborn.distplot):

Figure 6. Survivals Plotted Against Age (Figure by Author)

We can see that the survival rate is higher for children below 18, while for people above 18 and below 35, this rate is low. Age plays a role in Survival.

Finally, we need to see whether the Fare helps explain the Survival probability. Therefore, we plot the Fare variable (seaborn.distplot):

 Figure 7. Survivals Plotted Against Fare (Figure by Author)

In general, we can see that as the Fare paid by the passenger increases, the chance of survival increases, as we expected.


We will ignore three columns: Name, Cabin, Ticket since we need to use more advanced techniques to include these variables in our model. To give an idea of how to extract features from these variables: You can tokenize the passenger’s Names and derive their titles. Apart from titles like Mr. and Mrs., you will find other titles such as Master or Lady, etc. Surely, this played a role in who to save during that night. Therefore, you can take advantage of the given Name column as well as Cabin and Ticket columns.

Checking the Data for Null Values

Null values are our enemies! In the Titanic dataset, we have some missing values. First of all, we will combine the two datasets after dropping the training dataset’s Survived column.

We need to get information about the null values! There are two ways to accomplish this: .info() function and heatmaps (way cooler!). To be able to detect the nulls, we can use seaborn’s heatmap with the following code:

Here is the outcome. Yellow lines are the missing values.

 Figure 8. Heatmap of the Null Values (Figure by Author)

There are a lot of missing Age and Cabin values. Two values are missing in the Embarked column while one is missing in the Fare column. Let’s take care of these first. Alternatively, we can use the .info() function to receive the same information in text form:

 

 Figure 9. Null Value Information on Combined Titanic Data (Figure by Author)

Reading the Datasets

We will not get into the details of the dataset since it was covered in Part-I. Using the code below, we can import Pandas & Numpy libraries and read the train & test CSV files.

As we know from the above, we have null values in both train and test sets. We need to impute these null values and prepare the datasets for the model fitting and prediction separately.

Imputing Null Values

There are two main approaches to solve the missing values problem in datasets: drop or fill. Drop is the easy and naive way out; although, sometimes it might actually perform better. In our case, we will fill them unless we have decided to drop a whole column altogether.

The initial look of our dataset is as follows:

 Table 3. Initial Look of the Train Dataset (Table by Author)

We will make several imputation and transformations to get a fully numerical and clean dataset to be able to fit the machine learning model with the following code (it also contain imputation):

Python Code to Clean Train Dataset

After running this code on the train dataset, we get this:

 Table 4. Clean Version of the Train Dataset (Table by Author)

There are no null values, no strings, or categories that would get in our way. Now, we can split the data into two, Features (X or explanatory variables) and Label (Y or response variable), and then we can use the sklearn’s train_test_split() function to make the train test splits inside the train dataset.

Note: We have another dataset called test. This isn’t very clear due to the naming made by Kaggle. We are training and testing our model using the train dataset by splitting it into X_train, X_test, y_train, y_test DataFrames, and then applying the trained model on our test dataset to generate a predictions file.

Creating a Gradient Boosting Model and Train

 Figure 10. A Visualization of Gradient Boosting Algorithm (Figure by Author)

In Part-I, we used a basic Decision Tree model as our machine learning algorithm. Another well-known machine learning algorithm is Gradient Boosting Classifier, and since it usually outperforms Decision Tree, we will use Gradient Boosting Classifier in this tutorial. The code shared below allows us to import the Gradient Boosting Classifier algorithm, create a model based on it, fit and train the model using X_train and y_train DataFrames, and finally make predictions on X_test.

Now, we have the predictions, and we also know the answers since X_test is split from the train dataframe. To be able to measure our success, we can use the confusion matrix and classification report. You can achieve this by running the code below:

And this is the output:

 

Figure 11. Confusion Matrix and Classification Report on Our Results (Figure by Author)

We obtain about 82% accuracy, which may be considered pretty good, although there is still room for improvement.

Create the Prediction File for the Kaggle Competition

Now, we have a trained and working model that we can use to predict the passenger’s survival probabilities in the test.csv file.

First, we will clean and prepare the data with the following code (quite similar to how we clean the training dataset). Just note that we save PassengerId columns as a separate dataframe before removing it under the name ‘ids’.

Finally, we can predict the Survival values of the test dataframe and write to a CSV file as required with the following code.

There you have a new and better model for Kaggle competition. We made several improvements in our code, which increased the accuracy by around 15–20%, which is a good improvement. As I mentioned above, there is still some room for improvement, and the accuracy can increase to around 85–86%. However, the scoreboard scores are not very reliable, in my opinion, since many people used dishonest techniques to increase their ranking.

Part III of the This Mini-Series

In Part III, we will use more advanced techniques such as Natural Language Processing (NLP), Deep Learning, and GridSearchCV to increase our accuracy in Kaggle’s Titanic Competition.

Since you are reading this article, I am sure that we share similar interests and are/will be in similar industries. So let’s connect via Linkedin! Please do not hesitate to send a contact request! Orhan G. Yalçın — Linkedin

Kaggle’s Titanic Competition in 10 Minutes | Part-I

Kaggle’s Titanic Competition in 10 Minutes | Part-I

Complete Your First Kaggle Competition in Less Than 20 Lines of Code with Decision Tree Classifier | Machine Learning Tutorials

Since you are reading this article, I am sure that we share similar interests and are/will be in similar industries. So let’s connect via Linkedin! Please do not hesitate to send a contact request! Orhan G. Yalçın — Linkedin

  Photo by Markus Spiske on Unsplash

If you are interested in machine learning, you have probably heard of Kaggle. Kaggle is a platform where you can learn a lot about machine learning with Python and R, do data science projects, and (this is the most fun part) join machine learning competitions. Competitions are changed and updated over time. Currently, “Titanic: Machine Learning from Disaster” is “the beginner’s competition” on the platform. In this post, we will create a ready-to-upload submission file with less than 20 lines of Python code. To be able to this, we will use Pandas and Scikit-Learn libraries.

Titanic RMS and the Infamous Accident

RMS Titanic was the largest ship afloat when it entered service, and it sank after colliding with an iceberg during its first voyage to the United States on 15 April 1912. There were 2,224 passengers and crew aboard during the voyage, and unfortunately, 1,502 of them died. It was one of the deadliest commercial peacetime maritime disasters in the 20th century.

 Figure 1. A Greyscale Photo of Titanic RMS on Wikipedia

One of the main reasons for such a high number of casualties was the lack of sufficient lifeboats for the passengers and the crew. Although luck played a part in surviving the accident, some people such as women, children, and the upper-class passengers were more likely to survive than the rest. We will calculate this likelihood and effect of having particular features on the likelihood of surviving. And we will accomplish this in less than 20 lines of code and have a file ready for submission. … Let’s Get Started!

Download the Data

The Titanic dataset is an open dataset where you can reach from many different repositories and GitHub accounts. However, downloading from Kaggle will definitely be the best choice as the other sources may have slightly different versions and may not offer separate train and test files. So, please visit this link to download the datasets (Train.csv and Test.csv) to get started.

Normally our Train.csv file looks like this in Excel:

Table 1. Train Dataset in CSV Format

After converting it to the table in Excel (Data->Text to Columns), we get this view:

 Table 2. Train Dataset after Text to Column Operation

Way nicer, right! Now, we can clearly see that we have 12 variables. While the “Survived” variable represents whether a particular passenger survived the accident, the rest is the essential information about this passenger. Here is a brief explanation of the variables:

 Table 3. The Information on the Train Dataset Features

Load and Process The Training Data

 Figure 2. Photo by UX Indonesia on Unsplash

I assume that you have your Python environment installed. However, if you don’t have Python on your computer, you may refer to this link for Windows and this link for macOS. After making sure that you have Python installed on your system, open your favorite IDE, and start coding!

Note that using a Google Colab Notebook is another option, which does not require local Python3 installation. To have access to the Google Colab Notebook with the full code, consider signing up to the Newsletter using the slider below.

First, we will load the training data for cleaning and getting it ready for training our model. We will (i) load the data, (ii) delete the rows with empty values, (iii) select the “Survival” column as my response variable, (iv) drop the for-now irrelevant explanatory variables, (v) convert categorical variables to dummy variables, and we will accomplish all this with 7 lines of code:

Create the Model and Train

To uncover the relationship between the Survival variable and other variables (or features if you will), you need to select a statistical machine learning model and train your model with the processed data.

 Figure 4. A Simplified Decision Tree Schema for Titanic Case (Figure by Author)

Scikit-learn provides several algorithms for this. We will select the DecisionTreeClassifier, which is a basic but powerful algorithm for machine learning. And get this: We will only need 3 lines of code to reveal the hidden relationship between Survival (denoted as y) and the selected explanatory variables (denoted as X)

Make Predictions and Save Your Results

We may prepare our testing data for the prediction phase after revealing the hidden relationship between Survival and the selected explanatory variables. Test.csv file is slightly different than the Train.csv file: It does not contain the “Survival” column. This makes sense because if we would know all the answers, we could have just faked our algorithm and submit the correct answers after writing by hand (wait! some people somehow have already done that?). Anyway, our testing data needs almost the same kind of cleaning, massaging, prepping, and preprocessing for the prediction phase. We will accomplish this with 5 lines of code:

Now our test data is clean and prepared for prediction. Finally, make the predictions for the given test file and save it to memory:

So easy, right! Before saving these predictions, we need to obtain proper structure so that Kaggle can automatically score our predictions. Remember, we saved the PassengerId column to the memory as a separate dataset (DataFrame, if you will)? Now we will assign (or attach) the predictions dataset to PassengerIds (note that they are both single-column datasets). Finally, we will get the data from memory and save it in CSV (comma separated values) format required by Kaggle.

 
Figure 5. Photo by Pietro Mattia on Unsplash

Now you can visit Kaggle’s Titanic competition page, and after login, you can upload your submission file.

Will You Make It to the Top?

Definitely not! We tried to implement a simple machine learning algorithm enabling you to enter a Kaggle competition. As you improve this basic code, you will be able to rank better in the following submissions.

How I Built a Dashboard with Dash and Plotly after being stuck in Europe’s Worst Coronavirus Outbreak

How I Built a Dashboard with Dash and Plotly after being stuck in Europe’s Worst Coronavirus Outbreak

If you are reading this article, I am sure that we share similar interests and are/will be in similar industries. So let’s connect via Linkedin! Please do not hesitate to send a contact request! Orhan G. Yalçın — Linkedin

Also, many questions led us to create an Instagram Page for coronavirus updates. You can follow us for daily updates: OUTBREAKSTATS

Coronavirus (COVID-19) outbreak has affected everyone’s life one way or another. The number of confirmed cases is around 100,000 and the total number of deaths is over 3,000. The number of countries with confirmed cases is almost 70. On the other hand, more than the virus, the fear is even more disastrous. The cost of the outbreak has reached trillions of dollars and economists are talking about a global recession. This is certainly not a good start for 2020.

 

Before and After the Outbreak in Duomo, Milan (source: Unsplash)

Corona Outbreak in Northern Italy

So, we see the effect of the coronavirus everywhere from China to the U.S., from Norway to Australia. Statistically though, if you live in certain countries, you are more at risk and unfortunately, I have been experiencing it more severely since the city that I live in –Bologna– is located in Northern Italy, where the outbreak is the most fierce in Europe. Unfortunately, the total confirmed cases in Italy has already surpassed 2000.

 

The Current Situation in Europe with LN transformation (source: OutbreakStats)

As a person who had to isolate himself from the public, I had some free time and believe me, I used most of this time searching for the latest developments regarding the coronavirus outbreak. This dire situation gave me the idea of making the best out of the worst situation.

The Difficulty of Finding Up-to-date Coronavirus Outbreak Data

Searching for the coronavirus outbreak stats is a cumbersome work since every time I had to visit multiple sources to see what is happening. After some intense research, I found out that the Center for Systems Science and Engineering at Johns Hopkins University publishes the outbreak data with daily updates. However, the .csv format makes it challenging to derive meaningful insight every time. That’s why we have the discipline of Data Visualization.

 

A Table of the Confirmed Coronavirus Cases (accessed at CSSE)

Data Visualization with Plotly and Dash

One of the other things I have done during my self-isolation period was to excel in my knowledge on Plotly and Dash.

Plotly is an interactive Python data visualization library. You may use your iPython to generate beautiful interactive graphs in a matter of seconds. It is excellent to extract insights from CSV files. One may say that Plotly is a fancy Matplotlib. One downside of using Plotly, as opposed to pure Javascript data visualization libraries, is not being able to share it with others since it uses Python, a backend language (p.s. You may learn more about Javascript data visualization libraries with this article).

When I was thinking that I cannot be the only one seeing this disadvantage, I found out about Dash. Plotly, the Company behind these libraries, explain Dash as follows:

“The Dash platform empowers Data Science teams to focus on the data and models while producing and sharing enterprise-ready analytic apps that sit on top of Python and R models. What would typically require a team of backend developers, frontend developers, and IT can all be done with Dash.” [Plotly]

I could not agree more with this statement. Thanks to Dash, you don’t have an in-depth knowledge either in the backend or the frontend. To build and publish a dashboard, knowledge on Dash and Plotly is enough.

Over-Qualifying Myself for the Task

On the other hand, in the last three months. I completed three Udemy courses:

Funnily enough, Dash is built on top of Flask as backend, ReactJS as frontend, and uses Plotly as data visualizing tool. Even though I did not have to know React and Flask, it came handy at times when I try to figure out the logic behind Dash. Consequently, I thought as a person stuck in a virus outbreak and with new technical knowledge, why not create a Dashboard about my daily life these days by using the technical knowledge I recently obtained? The dashboard idea was the only ‘making the best out of the worst situation’ scenario for me. Therefore, I built and published the OutbreakStats dashboard with the help of the following sources:

The Sources I Used for the Dashboard

Database

News Updates

Choropleth Map

  • Map Box (you may use the Mapbox maps inside Plotly and Dash)

Technical Knowledge:

Server

Theme

IDE

My Thoughts on Plotly and Dash

I worked with Plotly before and did some freelance work on Upwork and Fiverr platforms. My clients loved the results as compared to Matplotlib and Seaborn since Plotly also provided interactivity. Being capable of publishing the work that I have done with Plotly library was the main motivation for me to learn Dash.

My experience with Dash, so far, is as follows:

 

Flask is a light-weight Python Web Framework

Dash heavily relies on Flask and it almost works as a Flask app. It is inevitable to notice the similarities between Dash and Flask. In a Dash app, there are two component types to build a dashboard: Core Components and HTML Components.

Core Components

‘Dash ships with supercharged components for interactive user interfaces. A core set of components, written and maintained by the Dash team, is available in the dash-core-components library.’ [Dash Documentation]

With core components, you can create (i) Plotly graphs and (ii)React components such as Dropdown, Slider, Input. Core components are like relatively complex React components that you may find in Semantic-UI-React or Material-UI. The difference is that you use Python to add these components to your dashboard. No need for frontend skills 🥳.

HTML Components

‘Dash is a web application framework that provides pure Python abstraction around HTML, CSS, and JavaScript.’ [Dash Documentation]

These components render basic HTML tags such as <p>,<Div>,<img>. Once more, the difference is that you use Python to add these components. Therefore, you don’t even have to know HTML to build a Dash app. No need for markup language skills🎉🍾.

Other Observations

To add custom javascript, CSS, image files, all you have to do is to create a folder (must be named as assets) and store all the files under this folder. They just work without any import or config setting since Dash is already configurated in this way.

Final Thoughts

Dash is a Flask app that is purposefully customized for Dashboard creation. Therefore, you don’t have to think about every little detail as they are already configurated by the Dash team. As long as you stick to the documentation to create your dashboard, the development process is very easy and fast. However, if you would like to customize the app and add new features to make it more than a dashboard, there may be shortcomings.

Conclusion

I hope you enjoy this blog post. In my previous articles, I usually shared my source code as they were more of a tutorial than a blog post. In this article, I try to show you a way of building a Dashboard and serve to the public. You can visit the one I created for Coronavirus Outbreak Data at OutbreakStats and here is a preview:

 

A preview of the OutbreakStats Dashboard

So, generating the dashboard code is up to your imagination. If you carefully follow The Sources I Used for the Dashboard section of this post, you may easily build your own dashboard even within a day.