Titanic: Predicting The Survivors



Titanic: First Kaggle Submission

The Titanic starring Leonardo DiCaprio and Kate Winslet was one of the first movies I have  memories of. I remember being about 6 years old and my aunt leaving it on for the children. It was probably not a good movie for children, but hey we were occupied for the full duration of the movie. As kids we just watched whatever and did not really care about the content, well at least I didn't.


I remember the ship sinking  and Leonardo tragically freezing and dieing so that his lover can live. A few years after watching the Titanic I came to find out the movie was based on true-events. I was like 'oh cool wow lmao'. But anyways, fast-forward to now, I found data from Kaggle regarding the Titanic and decided to do my first competition.


So as you may know, other than being a great movie, the Titanic was also known to be the most luxurious passenger steamship in the world. And aside from being this marvelous ship, it was also deemed to be 'practically unsinkable'. The unsinkable Titanic definitely did not get to live up to her name. The Titanic tragically sank during her maiden voyage and with her brought down thousands of lives. The data from Kaggle will be used to predict whether or not a passenger will have survived. My question is if I boarded the titanic, would I have lived? or would I have died?

The Data

This is the data provided from Kaggle.com. We will need to manipulate this data so 
that we can predict the likely hood of survival in the Titanic. What does each column represent?
And so I provided a table with the variables and its respective definition/keys.

Data Wrangling


The original data from Kaggle had a lot of missing data and unnecessary data. I decided that the names of the passengers, embarked location, ticket number,  and cabin location will not help me determine survival in the Titanic. Cabin location could be helpful, however it was dropped because there were too many missing values in the original data. Before dropping the names-column, I decided to make a graph using the titles of each passenger.

Graph 1: Survival Rate vs. Title
The graph shows the survival percentage of all the people in the titanic sorted by their titles. Many titles had a 100% survival rate. However, upon looking at the data, those with a 100% survival had too small of a sample compared to the rest, and so, sorting by titles was deemed insignificant.


The original data was wrangled into a data table that looked like the one directly above.

Data Graphs

Graph 2: Survival Rate vs Sex
Graph 3: Survival Rate vs. Age-Category
Graph 4: Survival Rate vs. Ticket Prices

By comparing the original ticket prices to today's approximate price. I categorized the ticket-fare into three different sections. Cheap tickets ranged from $0 to $12, Average tickets ranged from $12-30$, and the rest from $30+ I labeled expensive. Notice the trend between ticket price and survival rate.

Graph 5: Survival Rate vs. Class

          Graph 6: ParCh vs. Surival rate                                                      Graph 7: SbSp vs. Survival Rate

Machine Learning


After munging the data, the original data was separated into a test and a training group. By separating the data into two different groups I was able to predict whether or not a passenger on the Titanic was able to survive. The testing data was tested onto the trained data and fit into various machine learning models. The different models then predicted those who lived and those who died. 

The Table above shows [10] models and the respective accuracy score
The Table above shows the predicted values of survival using the Support Vector Machines Algorithm

What is all this?

Results


The results show that for every female that was boarded on the titanic, they had a 74% chance of survival. This means that 3 out of 4 females survived and only 1 out of the 4 died. Males, on the other hand had a 19% chance of survival. Only about 1 out of 5 males survived, and the rest of them died. 

Graph 3 shows the age-category and their likely hood of survival. Kids between ages 0-3 had the greatest chance of survival, while seniors, ages 65+ had the least. Just by looking at the age-category, if I was boarded on the titanic I would have had just a 33% chance of survival.

Graph 4 shows the survival rate between the fare prices. Those who had average priced tickets had double the chance of survival than those with cheap priced tickets. Those who had the expensive tickets had a 58% chance of survival as opposed to just 22% for the cheaper tickets. Graph 5 shows a similar trend. With the passengers in the first class having a 63% rate of survival as opposed to just 24% for the third class. 

Graph 6 & 7,  show the survival rates corresponding to the amount of  boarded parents/children and siblings/spouses  boarded respectively. Those who had 3 parents/children had the greatest chance of survival, which was at 60%. Those with 1 sibling/spouse had the greatest chance at 54%.

Out of the ten different machine learning models used,  Support Vector Machines model had the highest accuracy rate at 83.76 and the lowest, Perceptron at 69.54%.

Discussion


From the data its clear that if you were female during the sinking of the titanic you had the highest chance to survive. However, other than being female there were other factors that could have determined your survival, such as your age, financial situation, and the number of people you boarded with. To create better results for the future groupings of the different classes, fares, and ages can be changed. The name/title column can also be added to the model. I can also remove the price of the ticket variable because some people may have had excess to cheaper tickets than others. The class variable seems to be suffice in determining the financial background of the passengers. In the near future, in order to get a greater accuracy score, more information about the different machine learning algorithm needs to be studied.


Did I die or did I live?


Hypothetically if I boarded the Titanic, I would have boarded with my mom and my brother. Because I currently do not make enough money, I would have resorted to buying cheap tickets and being in the third class cabins. And so...I would be dead.






Comments