‘Essential’ Machine Learning Classification Coding Tutorials
Could you predict which passengers would survive the Titanic?
This is a classic classification problem, and we will use this problem to explore methods of machine learning, from logistic regression and Random Forests through to ‘Deep Learning’ using TensorFlow and PyTorch. Along the way we’ll look at processing data, how to run replicates, measuring accuracy, avoiding over-fitting with regularisation, trading off false positives and false negatives, feature selection and expansion, finding out how much data you need, optimising model parameters, dealing with imbalanced data, etc.
These examples will use Kaggle’s Titanic Survival data to explore machine learning. The Kaggle page may be found at: https://www.kaggle.com/c/titanic.
All examples have links to Jupyter Notebooks. The aim of these notebooks is not to give in-depth description of the methods, but to show what general methods should be employed when investigating/developing a machine-learning model, and to give working examples that may be used as the basis of further exploration.
If you haven’t already installed a scientific Python environment, then we recommend downloading from https://www.anaconda.com/distribution/
Classification ‘essentials’ using logistic regression
A note on downloading (rather than viewing) the notebooks: After you have clicked on the link the view the notebook, if you wish to download a notebook, right-click on the ‘raw’ box just above the code and on the right. Your browser should then have a ‘save link as…” or “download as….” or some such similar option.
Data preprocessing (you may wish to skip this to begin with – as data processing, though important, is quite a lot of work with very little immediate reward! Later workbooks will contain a link to pre-processed data).
Logistic regression – our first machine learning model.
Stratified k-fold validation – how to perform replicate assessments of accuracy for improved measurement of accuracy.
Regularisation – avoiding over-fitting to training data.
Receiver Operator Characteristic (ROC) curves, and adjusting sensitivity of models.
Feature selection using univariate statistical selection – how many features do you need, and which ones? A simple and fast method.
Feature selection using a model based method of forward selection – how many features do you need, and which ones? A method tailored to individual model performance.
Feature selection using a model based method of backwards elimination – how many features do you need, and which ones? A method tailored to individual model performance.
Feature expansion (followed by feature selection) – get extra features for free, and boost the Titanic logistic regression model.
Learning curves – How much data do you need? Will more data improve your model?
Optimising model parameters with grid search and random search – though default model parameters are often sensibly chosen, you can fine tune your model with grid search and random search.
Dealing with imbalanced data 1: Deal with imbalanced data by changing model weights.
Dealing with imbalanced data 2: Deal with imbalanced data by using under-sampling or over-sampling.
Dealing with imbalanced data 3: Deal with imbalanced data by changing classification thresholds.
Dealing with imbalanced data 4: Use SMOTE to create synthetic data to boost minority class.
Random Forest models
Random Forest Model: A popular robust method for classification with structured data.
Neural Network Models with PyTorch and TensorFlow
PyTorch ‘sequential’ neural net: A simpler, but less flexible PyTorch neural network.
PyTorch ‘class-based’ neural net: A more flexible, but slightly less simple, PyTorch neural network.
TensorFlow ‘sequential’ neural net: A simpler, but less flexible TensorFlow neural network.
TensorFlow api-based neural net: A more flexible, but slightly less simple, PyTorch neural network.
Checking model probabilities: Do model predicted probabilities calibrate well with actual probabilities of survival?