Kaggle Titanic Survival Prediction¶
This is from the Kaggle tutorial competition, Titanic: Machine Learning from Disaster. A lot of the codes are adapted from Dataquest lesson and Udacity Machine Learning Engineer Nanodegree.
In this example, we'll predict survival of passengers by looking at different passenger data, such as age, gender, etc. We'll try three different models---logistic regression, random forest, and extreme gradient boosting.
Importing necessary packages¶
We'll need pandas and numpy to work with data frames and do math manipulations. We'll use sklearn for machine learning models. Matplotlib will be also useful in case we want to plot anything!
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, KFold
import matplotlib.pyplot as plt
%matplotlib inline
train = pd.read_csv('train.csv', header=0)
train.head()
test = pd.read_csv('test.csv', header=0)
test.head()
Numeric data type¶
Let's turn sex and embarked into numeric data. Also, we won't bother with Name, Ticket, and Cabin right now, so let's remove those columns.
'female' = 0, 'male' = 1
'S' = 0, 'C' = 1, 'Q' = 2
train = train.drop(['Name','Ticket','Cabin'],1)
test = test.drop(['Name','Ticket','Cabin'],1)
train.Sex = train.Sex.map({'female':0, 'male':1}).astype(int)
test.Sex = test.Sex.map({'female':0, 'male':1}).astype(int)
train.Embarked = train.Embarked.fillna('S')
test.Embarked = test.Embarked.fillna('S')
train.Embarked = train.Embarked.map({'S':0, 'C':1, 'Q':2})
test.Embarked = test.Embarked.map({'S':0, 'C':1, 'Q':2})
train.head()
test.head()
Quick break: correlation¶
Now all our data is numeric. A useful concept to look at is correlation. It informs us of which features are good predictors. We'll look at it both as a table and a plot. We'll have to import another package to make this plot.
import seaborn as sb
corr = train.corr()
sb.heatmap(corr, annot=True)
plt.show()
We can see the sex and passenger class and fare are three biggest predictors. (well, fare and passenger class are most likely related, but whatever...).
Dealing with missing values¶
print(train.isnull().sum())
print(test.isnull().sum())
Earlier, we filled NA values for Embarked column. I didn't do anything fancy. I already knew S was most common port, and I just filled the missing values with that.
As shown above, we still have a lot of missing values for Age. We can try to do something more fancy with age, since it appears age is significantly correlated with few other features, namely Pclass and SibSp.
There's one missing fare value in test data set. Let's just fill this in with median of same Pclass.
for i in range(len(train.Age)):
if np.isnan(train.Age[i]):
if train.SibSp[i] == 0:
train.loc[i,"Age"] = np.nanmedian(train.Age[(train.Pclass==train.Pclass[i]) & (train.SibSp == 0)])
elif train.SibSp[i] > 0:
train.loc[i,"Age"] = np.nanmedian(train.Age[(train.Pclass==train.Pclass[i]) & (train.SibSp > 0)])
else:
train.loc[i,"Age"] = np.nanmedian(train.Age[train.Pclass==train.Pclass[i]])
for i in range(len(test.Age)):
if np.isnan(test.Age[i]):
if test.SibSp[i] == 0:
test.loc[i,"Age"] = np.nanmedian(train.Age[(train.Pclass == test.Pclass[i]) & (train.SibSp == 0)])
elif test.SibSp[i] > 0:
test.loc[i,"Age"] = np.nanmedian(train.Age[(train.Pclass == test.Pclass[i]) & (train.SibSp > 0)])
else:
test.loc[i,"Age"] = np.nanmedian(train.Age[train.Pclass == test.Pclass[i]])
test.loc[test.Fare.isnull(),"Fare"] = np.median(train.Fare[train.Pclass == test.Pclass[test.Fare.isnull()].iloc[0]])
We can do more, such as extracting out titles from names (Mr., Mrs., Dr., etc.) or calculating family size (SibSp + Parch). Feature engineering is an important portion of machine learning, but I haven't found much helpful feature for this example. If you're interested in how to implement them anyway, see Dataquest mission.
Machine Learning¶
Now that we have the data ready, it's time to train our data. We most likely don't want to use all features as predictors as it would overfit. We can play with different combinations of predictors, but let's just with try the below combination.
features = ["Pclass", "Sex", "Fare", "Age"]
Let's create a function that takes an algorithm, trains on a data with features X and labels Y, and gives us the accuracy on 5-fold cross validation set.
def cv_accuracy(X, Y, algorithm):
kf = KFold(n_splits=5, random_state=1, shuffle=True)
kf.get_n_splits(train)
idx_shuf = []
predictions = []
for tr, cv in kf.split(X):
idx_shuf.append(cv)
x_train = X.iloc[tr,:]
y_train = Y.iloc[tr]
algorithm.fit(x_train, y_train)
cv_pred = algorithm.predict(X.iloc[cv,:])
predictions.append(cv_pred)
idx_shuf = np.concatenate(idx_shuf, axis=0)
predictions = np.concatenate(predictions, axis=0)
accuracy = np.array(np.round(predictions) == Y.iloc[idx_shuf]).mean()
return accuracy
Logistic regression¶
Let's first look at logistic regression (see Wikipedia page for information on it).
alg = LogisticRegression(random_state=2015)
accuracy = cv_accuracy(train[features], train['Survived'], alg)
print(accuracy) #0.7845117845
Accuracy of 79.0% is okay, but we can try to improve it. Let's try this on test set and see what we get.
predictions = alg.predict(test[features])
submission = pd.DataFrame({
"PassengerId": test['PassengerId'],
"Survived": predictions})
submission.to_csv("prediction_logistic.csv", index=False)
This gets 75.1% accuracy on the leaderboard, which's pretty close to the cross validation accuracy.
Random Forest¶
Let's try a slightly more complicated model called Random Forest Ensemble using the same predictors.
alg = RandomForestClassifier(random_state=13, n_estimators=15, min_samples_split=5)
accuracy = cv_accuracy(train[features], train['Survived'], alg)
print(accuracy)
We get 81.3%, a pretty good improvement over the simple logistic regression. Now, let's apply this to our test set and see what we get.
predictions = alg.predict(test[features])
submission = pd.DataFrame({
"PassengerId": test['PassengerId'],
"Survived": predictions})
submission.to_csv("prediction_randomforest.csv", index=False)
This actually gives 74.6% accuracy, lower than our logistic regression! Maybe it's overfitting. Let's see if we can play with parameters to control our overfitting.
Parameter Tuning¶
I wrote some codes adapted from Udacity's Maching Learning Engineer Nanodegree to visualize change in accuracy in terms of change in parameter complexity. I'll look at three parameters in specific: n_estimators, min_samples_split, and max_depth.
import parameter_optimization as po
po.NEstimatorComplexity(train[features],train['Survived'],'n_estimators')
po.NEstimatorComplexity(train[features],train['Survived'],'min_samples_split')
po.NEstimatorComplexity(train[features],train['Survived'],'max_depth')
We can see that n_estimator and min_sample_split give relatively consistent cross-validation accuracy within some range. max_depth has a minor underfitting at very low values and overfitting at high values.
It's pretty hard to judge from these plots which parameter values are the best. Let's try something called a grid search, which is a method of looking at the exhaustive combinations of different parameter values to find the optimum parameters for the cross-validation accuracy. (Let's also add min_samples_leaf in there too.)
For questions on what all these parameters mean, refer to sklearn doc.
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
acc_score = make_scorer(po.acc_score_func)
from sklearn.model_selection import ShuffleSplit
clf = RandomForestClassifier()
params = {'n_estimators': range(5,16),'min_samples_split': range(10,51,1),
'max_depth': range(1,11), 'min_samples_leaf': range(1,11)}
cv = ShuffleSplit(n_splits = 10, test_size = 0.2, random_state = 2017)
grid = GridSearchCV(clf, params, scoring=acc_score, cv=cv.get_n_splits(train))
grid = grid.fit(train[features],train['Survived'])
grid.best_estimator_.get_params()
It has found the following optimum parameters:
max_depth = 7, n_estimators = 15, min_samples_split = 14, min_samples_leaf = 1
Let's run the algorithm again using those parameter values!
alg = RandomForestClassifier(random_state=13, max_depth=8, n_estimators=12, min_samples_split=14, min_samples_leaf=3)
accuracy = cv_accuracy(train[features], train['Survived'], alg)
print(accuracy)
82.9%, which is a small increase from our non-optimized classifier. That's a good sign!
predictions = alg.predict(test[features])
submission = pd.DataFrame({
"PassengerId": test['PassengerId'],
"Survived": predictions})
submission.to_csv("prediction_randomforest_optimized.csv", index=False)
This submission gave a accuracy of 77.0, which is a 2.4% increase from our last one, just by tuning the parameters. Good job!
Gradient Boosting¶
Now let's try a technique called gradient boosting using the library XGBoost. XGBoost doc has a good introduction to gradient boosting as well.
import xgboost as xgb
param = {'max_depth':4, 'gamma':0.5, 'min_child_weight':1,
'eta':1, 'silent':0, 'objective':'binary:logistic',
'silent':1, 'seed':29}
from sklearn.model_selection import ShuffleSplit
Let's re-write our accuracy function for XGBoost. This should look basically the same as our other one.
def xgb_cv_accuracy(X, Y, param):
num_round = 3
kf = KFold(n_splits=5, random_state=1, shuffle=True)
kf.get_n_splits(train)
idx_shuf = []
predictions = []
for tr, cv in kf.split(X):
idx_shuf.append(cv)
dtrain = xgb.DMatrix(X.iloc[tr],label=Y.iloc[tr])
dtest = xgb.DMatrix(X.iloc[cv],label=Y.iloc[cv])
watchlist = [(dtest,'validation'),(dtrain,'train')]
bst = xgb.train(param, dtrain, num_round, watchlist)
cv_pred = bst.predict(dtest)
predictions.append(cv_pred)
idx_shuf = np.concatenate(idx_shuf, axis=0)
predictions = np.concatenate(predictions, axis=0)
accuracy = np.array(np.round(predictions) == Y.iloc[idx_shuf]).mean()
return accuracy
accuracy = xgb_cv_accuracy(train[features], train['Survived'],param)
print(accuracy)
Without tuning any parameter, we got 80.6% cross-validation accuracy. It seems lower than the random forest ensemble, but let's look at the test accuracy.
dtrain = xgb.DMatrix(train[features],label=train['Survived'])
bst = xgb.train(param, dtrain)
predictions = np.round(bst.predict(xgb.DMatrix(test[features]))).astype(int)
submission = pd.DataFrame({
"PassengerId": test['PassengerId'],
"Survived": predictions})
submission.to_csv("prediction_xgboost.csv", index=False)
We get 76.6% test accuracy, close to our last result. Not bad!
Next step should be clear, let's do grid search on xgboost! Unfortunately, xgboost library doesn't come with a grid search technique. However, someone from Kaggle graciously provided an easy way to make xgboost compatible with sklearn classifiers.
Let's do grid search with bunch of parameters... why not if you have the time, right?
import XGBoost_Classfier as xgb_clf
clf = xgb_clf.XGBoostClassifier(num_class=2, silent=1, seed=29)
params = {'num_boost_round': [50,100,150,200],'max_depth': range(4,11), 'eta': np.logspace(-2,-0.4,num=9,base=10.),
'gamma': np.logspace(-1,1,num=11,base=10.), 'min_child_weight': range(1,6)}
cv = ShuffleSplit(n_splits = 10, test_size = 0.2, random_state = 2017)
clf = GridSearchCV(clf, params, cv=cv.get_n_splits(train))
clf.fit(train[features],train['Survived'])
best_parameters, score, _ = max(clf.grid_scores_, key=lambda x: x[1])
print(best_parameters)
Again, we use the parameters to see what we get on the test set!
dtrain = xgb.DMatrix(train[features],label=train['Survived'])
bst = xgb.train(best_parameters, dtrain)
predictions = np.round(bst.predict(xgb.DMatrix(test[features]))).astype(int)
submission = pd.DataFrame({
"PassengerId": test['PassengerId'],
"Survived": predictions})
submission.to_csv("prediction_xgboost_opt.csv", index=False)
This gives 78.5% test accuracy, nice!
I see that on the Kaggle Public Leaderboard, the benchmark for Gender, Price and Class Based Model is pretty high (78.0%)! Are we overfitting by using the age feature? Well... let's try it without it!
features = ["Pclass", "Sex", "Fare"]
params = {'num_boost_round': [50,100,150,200],'max_depth': range(4,11), 'eta': np.logspace(-2,-0.4,num=9,base=10.),
'gamma': np.logspace(-1,1,num=11,base=10.), 'min_child_weight': range(1,6)}
clf = xgb_clf.XGBoostClassifier(num_class=2, silent=1, seed=29)
cv = ShuffleSplit(n_splits = 10, test_size = 0.2, random_state = 2017)
clf = GridSearchCV(clf, params, cv=cv.get_n_splits(train))
clf.fit(train[features],train['Survived'])
best_parameters, score, _ = max(clf.grid_scores_, key=lambda x: x[1])
print(best_parameters)
dtrain = xgb.DMatrix(train[features],label=train['Survived'])
bst = xgb.train(best_parameters, dtrain)
predictions = np.round(bst.predict(xgb.DMatrix(test[features]))).astype(int)
submission = pd.DataFrame({
"PassengerId": test['PassengerId'],
"Survived": predictions})
submission.to_csv("prediction_xgboost_opt_noage.csv", index=False)
Yay, this gave a slightly lower score of 77.5% test accuracy, even slight lower than the Gender, Price and Class Based Model benchmark.
Last remarks¶
There are at least one thing wrong here, which I may come back to and fix when I have the time.
Some of the "optimized" parameters are at an extreme in the range, which may indicate that I didn't look at big enough range.