top of page

Kaggle Titanic Challenge

NIRAJ KUMAR

Updated: Oct 7, 2022

Problem Statement: Use the Titanic passenger data (name, age, price of ticket, etc) to try to predict who will survive and who will die.


Data: The data is divided into three files as listed below:

  • Train.csv : Contains information of passengers (like passenger id, age, fare, name) on boarded titanic. It also contains information whether the passenger survived or not.

  • Test.csv : Apply model on test dataset to predict whether they survived or not

  • gender_submission.csv : Sample file for submission



Mounting google drive and reading the files:



Data Preprocessing and Heat Map for data correlation:


Data Preprocessing step: checking for NA and Null value in training data set. Generated correlation heat map to find the features having positive correlation value.


Creating new data frame from existing and removing the NA, NULL values in Test Dataset with mean value for the column. Fare attribute has positive correlation value.



Pattern Analysis:


Based on gender_submission.csv, Majority of female passengers survived and only a small fraction of male passengers survived as depicted in below.



Models Used:


The models listed below were used for prediction of survival of passengers in Titanic. Different features and hyper-parameters were used to increase the accuracy of the predictions.

  1. Random Forest model

  2. K-Neighbors

  3. SVM

Model and their Accuracy : Rows highlighted in light green have higher accuracy than base accuracy of 0.77511

Model

Features

Hyper Parameters

Score

​RandomForest

["Pclass", "Sex", "SibSp", "Parch"]

n_estimators=100, max_depth=5, random_state=1

0.77511

​RandomForest

["Pclass", "Sex", "SibSp", "Parch"]

n_estimators=1, max_depth=1, random_state=1

0.76555

​RandomForest

["Pclass", "Sex", "SibSp", "Parch","Fare"]

​n_estimators=20, max_depth=5, random_state=1

0.66746

​RandomForest

["Pclass", "Sex", "SibSp", "Parch","Fare"]

n_estimators=100, max_depth=16, random_state=2

0.77272

​RandomForest

["Pclass", "Sex", "SibSp", "Parch"]

n_estimators=100, max_depth=16, random_state=2

0.78229

SVM

["Pclass", "Sex", "SibSp", "Parch"]

default

0.77751

SVM

["Pclass", "Sex", "SibSp", "Parch","Fare"]

default

0.66746

K Neighbors

["Pclass", "Sex", "SibSp", "Parch"]

default

0.77751

K Neighbors

["Pclass", "Sex", "SibSp", "Parch","Fare"]

n_neighbors=10,algorithm='kd_tree'

0.71052

Kaggle Submission:



References:


25 views

Recent Posts

See All

Comments


bottom of page