Titanic Survival

‘Essential’ Machine Learning Classification Coding Tutorials

This page now has its own GitHub repository, which has the ability to run the code (without any software installs needed) on Binder: https://github.com/MichaelAllen1966/2004_titanic

Could you predict which passengers would survive the Titanic?

This is a classic classification problem, and we will use this problem to explore methods of machine learning, from logistic regression and Random Forests through to ‘Deep Learning’ using TensorFlow and PyTorch. Along the way we’ll look at processing data, how to run replicates, measuring accuracy, avoiding over-fitting with regularisation, trading off false positives and false negatives, feature selection and expansion, finding out how much data you need, optimising model parameters, dealing with imbalanced data, etc.

These examples will use Kaggle’s Titanic Survival data to explore machine learning. The Kaggle page may be found at: https://www.kaggle.com/c/titanic.

All examples have links to Jupyter Notebooks. The aim of these notebooks is not to give in-depth description of the methods, but to show what general methods should be employed when investigating/developing a machine-learning model, and to give working examples that may be used as the basis of further exploration.

If you haven’t already installed a scientific Python environment, then we recommend downloading from https://www.anaconda.com/distribution/

Classification ‘essentials’ using logistic regression

A note on downloading (rather than viewing) the notebooks: After you have clicked on the link the view the notebook, if you wish to download a notebook, right-click on the ‘raw’ box just above the code and on the right. Your browser should then have a ‘save link as…” or “download as….” or some such similar option.

Data preprocessing (you may wish to skip this to begin with – as data processing, though important, is quite a lot of work with very little immediate reward! Later workbooks will contain a link to pre-processed data).

Logistic regression – our first machine learning model.

Stratified k-fold validation – how to perform replicate assessments of accuracy for improved measurement of accuracy.

Regularisation – avoiding over-fitting to training data.

Accuracy measurements in machine learning – go beyond the percentage of cases identified correctly. Stand-alone example or Applied to Titanic survival.

Receiver Operator Characteristic (ROC) curves, and adjusting sensitivity of models.

Feature selection using univariate statistical selection – how many features do you need, and which ones? A simple and fast method.

Feature selection using a model based method of forward selection – how many features do you need, and which ones? A method tailored to individual model performance.

Feature selection using a model based method of backwards elimination – how many features do you need, and which ones? A method tailored to individual model performance.

Feature expansion (followed by feature selection) – get extra features for free, and boost the Titanic logistic regression model.

Learning curves – How much data do you need? Will more data improve your model?

Optimising model parameters with grid search and random search – though default model parameters are often sensibly chosen, you can fine tune your model with grid search and random search.

Dealing with imbalanced data 1: Deal with imbalanced data by changing model weights.

Dealing with imbalanced data 2: Deal with imbalanced data by using under-sampling or over-sampling.

Dealing with imbalanced data 3: Deal with imbalanced data by changing classification thresholds.

Dealing with imbalanced data 4: Use SMOTE to create synthetic data to boost minority class.

Random Forest models

Random Forest Model: A popular robust method for classification with structured data.

Random Forest Receiver Operator Characteristic (ROC) curve and balancing of model classification

Neural Network Models with PyTorch and TensorFlow

PyTorch ‘sequential’ neural net: A simpler, but less flexible PyTorch neural network.

PyTorch ‘class-based’ neural net: A more flexible, but slightly less simple, PyTorch neural network.

TensorFlow ‘sequential’ neural net: A simpler, but less flexible TensorFlow neural network.

TensorFlow api-based neural net: A more flexible, but slightly less simple, PyTorch neural network.

TensorFlow Receiver Operator Characteristic (ROC) curve and balancing of model classification

TensorFlow extras: Changing class weights, saving model checkpoints, stopping model when no further improvement found, accessing model weights.

TensorBoard: interactive visualisation of TensorFlow model training

A Wide and Deep TensorFlow model: combines shallow and deep learning in a single model.

Monte Carlo Dropout: Enhancing accuracy and getting a measure of uncertainty using a very easy to implement method.

Bagging: Enhancing performance using an ensemble of nets trained on different bootstrap samples of data.

Miscellaneous extras

Checking model probabilities: Do model predicted probabilities calibrate well with actual probabilities of survival?

Remembering the Titanic…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s