Problem Statement: Use the Titanic passenger data (name, age, price of ticket, etc) to try to predict who will survive and who will die.
Data: The data is divided into three files as listed below:
Train.csv : Contains information of passengers (like passenger id, age, fare, name) on boarded titanic. It also contains information whether the passenger survived or not.
Test.csv : Apply model on test dataset to predict whether they survived or not
gender_submission.csv : Sample file for submission
Mounting google drive and reading the files:
Data Preprocessing and Heat Map for data correlation:
Data Preprocessing step: checking for NA and Null value in training data set. Generated correlation heat map to find the features having positive correlation value.
Creating new data frame from existing and removing the NA, NULL values in Test Dataset with mean value for the column. Fare attribute has positive correlation value.
Pattern Analysis:
Based on gender_submission.csv, Majority of female passengers survived and only a small fraction of male passengers survived as depicted in below.
Models Used:
The models listed below were used for prediction of survival of passengers in Titanic. Different features and hyper-parameters were used to increase the accuracy of the predictions.
Random Forest model
K-Neighbors
SVM
Model and their Accuracy : Rows highlighted in light green have higher accuracy than base accuracy of 0.77511
Model | Features | Hyper Parameters | Score |
RandomForest | ["Pclass", "Sex", "SibSp", "Parch"] | n_estimators=100, max_depth=5, random_state=1 | 0.77511 |
RandomForest | ["Pclass", "Sex", "SibSp", "Parch"] | n_estimators=1, max_depth=1, random_state=1 | 0.76555 |
RandomForest | ["Pclass", "Sex", "SibSp", "Parch","Fare"] | n_estimators=20, max_depth=5, random_state=1 | 0.66746 |
RandomForest | ["Pclass", "Sex", "SibSp", "Parch","Fare"] | n_estimators=100, max_depth=16, random_state=2 | 0.77272 |
RandomForest | ["Pclass", "Sex", "SibSp", "Parch"] | n_estimators=100, max_depth=16, random_state=2 | 0.78229 |
SVM | ["Pclass", "Sex", "SibSp", "Parch"] | default | 0.77751 |
SVM | ["Pclass", "Sex", "SibSp", "Parch","Fare"] | default | 0.66746 |
K Neighbors | ["Pclass", "Sex", "SibSp", "Parch"] | default | 0.77751 |
K Neighbors | ["Pclass", "Sex", "SibSp", "Parch","Fare"] | n_neighbors=10,algorithm='kd_tree' | 0.71052 |
Kaggle Submission:
References:
Comments