113: Regression analysis with TensorFlow

This code comes from the TensorFlow tutorial here, with minor modifications (such as the additional of regularization to avoid over-fitting).

In a regression problem, we aim to predict the output of a continuous value, like a price or a probability. Contrast this with a classification problem, where we aim to predict a discrete label (for example, where a picture contains an apple or an orange).

This notebook uses the classic Auto MPG Dataset and builds a model to predict the fuel efficiency of late-1970s and early 1980s automobiles. To do this, we’ll provide the model with a description of many models from that time period. This description includes attributes like: cylinders, displacement, horsepower, and weight.


# If needed install seaborn (conda install seaborn or pip install seaborn)

import pathlib
import pandas as pd
import seaborn as sns
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow import keras
from tensorflow.keras import layers

###############################################################################
############################## LOAD DATA ######################################
###############################################################################

# Load data from web and save locally
dataset_path = keras.utils.get_file("auto-mpg.data", 
    "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")
dataset_path
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin'] 
raw_dataset = pd.read_csv(dataset_path, names=column_names,
                      na_values = "?", comment='\t',
                      sep=" ", skipinitialspace=True)

raw_dataset.to_csv('mpg.csv', index=False)

# Load data locally
#data = pd.read_csv('mpg.csv')

###############################################################################
############################## CLEAN DATA #####################################
###############################################################################

# Dataset contains some missing data (see by using print(data.isna().sum()))
# Drop rows with missing data
data = data.dropna()

# The "Origin" column is really categorical, not numeric. 
# So convert that to a one-hot:

origin = data.pop('Origin')
data['USA'] = (origin == 1)*1.0
data['Europe'] = (origin == 2)*1.0
data['Japan'] = (origin == 3)*1.0

###############################################################################
############################## CLEAN DATA #####################################
###############################################################################

train_dataset = data.sample(frac=0.8,random_state=0)
test_dataset = data.drop(train_dataset.index)

###############################################################################
############################# EXAMINE DATA ####################################
###############################################################################

# Have a quick look at the joint distribution of a few pairs of columns from
# the training set.

g = sns.pairplot(
        train_dataset[["MPG", "Cylinders", "Displacement", "Weight"]], 
        diag_kind="kde")

fig = g.fig # convert to matplotlib plot. Other seaborn use fig = g.getfig()
fig.show()

# Look at overall stats
train_stats = train_dataset.describe()
train_stats.pop("MPG")
train_stats = train_stats.transpose()
print (train_stats)

###############################################################################
###################### SPLIT FEATURES FROM LABELS #############################
###############################################################################

train_labels = train_dataset.pop('MPG')
test_labels = test_dataset.pop('MPG')

###############################################################################
########################### NORMALISE THE DATA ################################
###############################################################################

# Normalise using the mean and standard deviation from the training set

def norm(x):
  return (x - train_stats['mean']) / train_stats['std']

normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)

###############################################################################
############################### BUILD MODEL ###################################
###############################################################################

# Here, we'll use a Sequential model with two densely connected hidden layers,
# and an output layer that returns a single, continuous value. Regularisation
# helps prevent over-fitting (try adjusting the values; higher numbers = more
# regularisation. Regularisation may be type l1 or l2.)

def build_model():
  model = keras.Sequential([
    layers.Dense(64, kernel_regularizer=keras.regularizers.l1(0.01),
                 activation=tf.nn.relu, 
                 input_shape=[len(train_dataset.keys())]),
                                
    keras.layers.Dense(64, kernel_regularizer=keras.regularizers.l1(0.01),
                 activation=tf.nn.relu),
                       
    keras.layers.Dense(1)])

  optimizer = tf.train.RMSPropOptimizer(0.001)

  model.compile(loss='mse',
                optimizer=optimizer,
                metrics=['mae', 'mse'])
  return model

model = build_model()

# Print a summary of the model

print (model.summary())

###############################################################################
############################### TRAIN MODEL ###################################
###############################################################################

# Display training progress by printing a single dot for each completed epoch
class PrintDot(keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs):
    if epoch % 100 == 0: print('')
    print('.', end='')

EPOCHS = 1000

history = model.fit(
  normed_train_data, train_labels,
  epochs=EPOCHS, validation_split = 0.2, verbose=0,
  callbacks=[PrintDot()])

# Show last few epochs in history
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
print(hist.tail())

###############################################################################
############################### PLOT TRAINING #################################
###############################################################################

def plot_history(history):
  plt.figure()
  plt.xlabel('Epoch')
  plt.ylabel('Mean Abs Error [MPG]')
  plt.plot(hist['epoch'], hist['mean_absolute_error'],
           label='Train Error')
  plt.plot(hist['epoch'], hist['val_mean_absolute_error'],
           label = 'Val Error')
  plt.legend()
  plt.ylim([0,5])
  
  plt.figure()
  plt.xlabel('Epoch')
  plt.ylabel('Mean Square Error [$MPG^2$]')
  plt.plot(hist['epoch'], hist['mean_squared_error'],
           label='Train Error')
  plt.plot(hist['epoch'], hist['val_mean_squared_error'],
           label = 'Val Error')
  plt.legend()
  plt.ylim([0,20])
  plt.show()

plot_history(history)

###############################################################################
############################# MAKE PREDICTIONS ################################
###############################################################################

# Make predictions from test-set

test_predictions = model.predict(normed_test_data).flatten()

# Scatter plot plot
plt.scatter(test_labels, test_predictions)
plt.xlabel('True Values [MPG]')
plt.ylabel('Predictions [MPG]')
plt.axis('equal')
plt.axis('square')
plt.xlim([0,plt.xlim()[1]])
plt.ylim([0,plt.ylim()[1]])
_ = plt.plot([-100, 100], [-100, 100])
plt.show()

# Error plot
error = test_predictions - test_labels
plt.hist(error, bins = 25)
plt.xlabel("Prediction Error [MPG]")
_ = plt.ylabel("Count")
plt.show

# Copyright (c) 2017 Fran├žois Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.

110. TensorFlow text-based classification – from raw text to prediction

Download the py file from this here: tensorflow.py

If you need help installing TensorFlow, see our guide on installing and using a TensorFlow environment.

Below is a worked example that uses text to classify whether a movie reviewer likes a movie or not.

The code goes through the following steps:
1. import libraries
2. load data
3. clean data
4. convert words to numbers
5. process data for tensorflow
6. build model
7. train model
8. predict outcome (like movie or nor) for previously unseen reviews

Please also see the TensorFlow tutorials where the TensorFlow model building code came from:

https://www.tensorflow.org/tutorials/keras/basic_text_classification

https://www.tensorflow.org/tutorials/keras/overfit_and_underfit

"""
This example starts with with raw text (movie reviews) and predicts whether the 
reviewer liked the movie.

The code goes through the following steps:
    1. import libraries
    2. load data
    3. clean data
    4. convert words to numbers
    5. process data for tensorflow
    6. build model
    7. train model
    8. predict outcome (like movie or nor) for previously unseen reviews

For information on installing a tensorflow environment in Anaconda see:
https://pythonhealthcare.org/2018/12/19/106-installing-and-using-tensorflow-using-anaconda/

For installing anaconda see:
https://www.anaconda.com/download

We import necessary libraries.

If you are missing a library then if using Anaconda from a command line (after
activating tensorflow library) use:
    conda import library-name
        
If you find you missing a nltk download then from a command line (after
activating tensorflow library) use:
    python (to being command line python session)
    import nltk
    nltk.download(library name)
    or
    nltk.download() will open dialoge box where you can install any/all nltk
    libraries
"""

###############################################################################
############################## IMPORT LIBRARIES ############################### 
###############################################################################

import numpy as np
import pandas as pd
import nltk
import tensorflow as tf

from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from tensorflow import keras


# If not previously performed:
# nltk.download('stopwords')

###############################################################################
################################## LOAD DATA ################################## 
###############################################################################

"""
Here we load up a csv file. Each line contains a text string and then a label.
An example is given to download the imdb dataset which contains 50,000 movie
reviews. The label is 0 or 1 depending on whetehr the reviewer liked the movie.
"""
print ('Loading data')

# If you do not already have the data locally you may download (and save) by:
file_location = 'https://gitlab.com/michaelallen1966/00_python_snippets' +\     '_and_recipes/raw/master/machine_learning/data/IMDb.csv'
data = pd.read_csv(file_location)
# save to current directory
data.to_csv('imdb.csv', index=False)

# If you already have the data locally then you may run the following
# data = pd.read_csv('imdb.csv')

# Change headings of dataframe to make them more universal
data.columns=['text','label']

# We'll now hold back 5% of data for a final test that ha snot been used in
# training

number_of_records = data.shape[0]
number_to_hold_back = int(number_of_records * 0.05)
number_to_use = number_of_records - number_to_hold_back
data = data.head(number_to_use)
data_held_back = data.tail(number_to_hold_back)

###############################################################################
################################## CLEAN DATA ################################# 
###############################################################################

"""
Here we process the data in the following ways:
  1) change all text to lower case
  2) tokenize (breaks text down into a list of words)
  3) remove punctuation and non-word text
  4) find word stems (e.g. runnign, run and runner will be converted to run)
  5) removes stop words (commonly occuring words of little value, e.g. 'the')
"""

stemming = PorterStemmer()
stops = set(stopwords.words("english"))

def apply_cleaning_function_to_list(X):
    cleaned_X = []
    for element in X:
        cleaned_X.append(clean_text(element))
    return cleaned_X

def clean_text(raw_text):
    """This function works on a raw text string, and:
        1) changes to lower case
        2) tokenizes (breaks text down into a list of words)
        3) removes punctuation and non-word text
        4) finds word stems
        5) removes stop words
        6) rejoins meaningful stem words"""
    
    # Convert to lower case
    text = raw_text.lower()
    
    # Tokenize
    tokens = nltk.word_tokenize(text)
    
    # Keep only words (removes punctuation + numbers)
    # use .isalnum to keep also numbers
    token_words = [w for w in tokens if w.isalpha()]
    
    # Stemming
    stemmed_words = [stemming.stem(w) for w in token_words]
    
    # Remove stop words
    meaningful_words = [w for w in stemmed_words if not w in stops]
      
    # Return cleaned data
    return meaningful_words

print ('Cleaning text')
# Get text to clean
text_to_clean = list(data['text'])

# Clean text and add to data
data['cleaned_text'] = apply_cleaning_function_to_list(text_to_clean)

###############################################################################
######################## CONVERT WORDS TO NUMBERS ############################# 
###############################################################################

"""
The frequency of all words is counted. Words are then given an index number so
that th emost commonly occurring words hav ethe lowest number (so the 
dictionary may then be truncated at any point to keep the most common words).
We avoid using the index number zero as we will use that later to 'pad' out
short text.
"""

def training_text_to_numbers(text, cutoff_for_rare_words = 1):
    """Function to convert text to numbers. Text must be tokenzied so that
    test is presented as a list of words. The index number for a word
    is based on its frequency (words occuring more often have a lower index).
    If a word does not occur as many times as cutoff_for_rare_words,
    then it is given a word index of zero. All rare words will be zero.
    """

    # Flatten list if sublists are present
    if len(text) > 1:
        flat_text = [item for sublist in text for item in sublist]
    else:
        flat_text = text
    
    # get word freuqncy
    fdist = nltk.FreqDist(flat_text)

    # Convert to Pandas dataframe
    df_fdist = pd.DataFrame.from_dict(fdist, orient='index')
    df_fdist.columns = ['Frequency']

    # Sort by word frequency
    df_fdist.sort_values(by=['Frequency'], ascending=False, inplace=True)

    # Add word index
    number_of_words = df_fdist.shape[0]
    df_fdist['word_index'] = list(np.arange(number_of_words)+1)
    
    # Convert pandas to dictionary
    word_dict = df_fdist['word_index'].to_dict()
    
    # Use dictionary to convert words in text to numbers
    text_numbers = []
    for string in text:
        string_numbers = [word_dict[word] for word in string]
        text_numbers.append(string_numbers)
    
    return (text_numbers, df_fdist)

# Call function to convert training text to numbers
print ('Convert text to numbers')
numbered_text, dict_df = \
    training_text_to_numbers(data['cleaned_text'].values)

# Keep only word freqeuncies 1 to 10000
def limit_word_count(numbered_text):
    max_word_count = 10000
    filtered_text = []
    for number_list in numbered_text:
        filtered_line = \
            [number for number in number_list if number <=max_word_count]
        filtered_text.append(filtered_line)
        
    return filtered_text
    
data['numbered_text'] = limit_word_count(numbered_text)

# Pickle dataframe and dictionary dataframe (for later use if required)
data.to_pickle('data_numbered.p')
dict_df.to_pickle('data_dictionary_dataframe.p')

###############################################################################
######################### PROCESS DATA FOR TENSORFLOW ######################### 
###############################################################################

"""
Here we extract data from the pandas DataFrame, make all text vectors the same
length (by padding short texts and truncating long ones). We then split into
trainign and test data sets.
"""

print ('Process data for TensorFlow model')

# At this point pickled data (processed in an earlier run) might be loaded with 
# data=pd.read_pickle(file_name)
# dict_df=pd.read_pickle(filename)

# Get data from datframe and put in X and y lists
X = list(data.numbered_text.values)
y = data.label.values

## MAKE ALL X DATA THE SAME LENGTH
# We will use keras to make all X data a length of 512.
# Shorter data will be padded with 0, longer data will be truncated.
# We have oreviously kept the value zero free from use..

processed_X = \
    keras.preprocessing.sequence.pad_sequences(X,
                                               value=0,
                                               padding='post',
                                               maxlen=512)

## SPLIT DATA INTO TRAINIGN AND TEST SETS

X_train, X_test, y_train, y_test=train_test_split(
        processed_X,y,test_size=0.2,random_state=999)

###############################################################################
########################## BUILD MODEL AND OPTIMIZER ########################## 
###############################################################################

"""
Here we construct a four-layer neural network with keras/tensorflow.
The first layer is the input layer, then we have two hidden layers, and an
output layer.
"""

print ('Build model')

# BUILD MODEL

# input shape is the vocabulary count used for the text-to-numebr conversion 
# (10,000 words plus one for our zero padding)
vocab_size = 10001

###############################################################################
# Info on neural network layers
#
# The layers are stacked sequentially to build the classifier:
#
# The first layer is an Embedding layer. This layer takes the integer-encoded 
# vocabulary and looks up the embedding vector for each word-index. These 
# vectors are learned as the model trains. The vectors add a dimension to the 
# output array. The resulting dimensions are: (batch, sequence, embedding).
#
# Next, a GlobalAveragePooling1D layer returns a fixed-length output vector for
# each example by averaging over the sequence dimension. This allows the model 
# to handle input of variable length, in the simplest way possible.
#
# This fixed-length output vector is piped through a fully-connected (Dense) 
# layer with 16 hidden units.
#
# The last layer is densely connected with a single output node. Using the 
# sigmoid activation function, this value is a float between 0 and 1, 
# representing a probability, or confidence level.
#
# The regulaizers help prevent over-fitting. Overfitting is evident when the
# trainign data fit is significant better than the test data fit. The level and
# the type may be adjusted to maximise test accuracy
##############################################################################

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))        

model.add(keras.layers.GlobalAveragePooling1D())

model.add(keras.layers.Dense(16, activation=tf.nn.relu, 
                             kernel_regularizer=keras.regularizers.l2(0.01)))

model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid,
                             kernel_regularizer=keras.regularizers.l2(0.01)))

model.summary()

# CONFIGURE OPTIMIZER

model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='binary_crossentropy',
              metrics=['accuracy'])

###############################################################################
################################# TRAIN MODEL ################################# 
###############################################################################

"""
Here we train the model. Using more epochs may give higher accuracy.

In 'real life' you may wish to hold back other test data (e.g. 10% of the 
orginal data so that you may use the test set here to help optimise the 
neutral network parameters and then test the final model on an independent data
set.

When verbose is set to 1, the model will show accuracy and loss for training 
and test data sets

"""

print ('Train model')

# Train model (verbose = 1 shows training progress)
model.fit(X_train,
          y_train,
          epochs=100,
          batch_size=512,
          validation_data=(X_test, y_test),
          verbose=1)


results = model.evaluate(X_train, y_train)
print('\nTraining accuracy:', results[1])

results = model.evaluate(X_test, y_test)
print('\nTest accuracy:', results[1])

###############################################################################
######################### PREDICT RESULTS FOR NEW TEXT ######################## 
###############################################################################

"""
Here we make predictions from text that has never been applied before. As we
are using data that has been held back we may also check its accuracy against 
a known label
 """

print ('\nMake predictions')

# We held some data back from thr original test set
# We will first clean the text

text_to_clean = list(data_held_back['text'].values)
X_clean = apply_cleaning_function_to_list(text_to_clean)
 
# Now we need to convert words to numbers.
# As these are new data it is possible that the word is not recognized so we
# will check the word is in the dictionary

# Convert pandas dataframe to dictionary
word_dict = dict_df['word_index'].to_dict()

# Use dictionary to convert words in text to numbers
text_numbers = []
for string in X_clean:
    string_numbers = []
    for word in string:
        if word in word_dict:
            string_numbers.append(word_dict[word])
    text_numbers.append(string_numbers)

# Keep only the top 10,000 words
# The function is repeated here for clarity (but would not usually be repeated)  

def limit_word_count(numbered_text):
    max_word_count = 10000
    filtered_text = []
    for number_list in numbered_text:
        filtered_line = \
            [number for number in number_list if number <=max_word_count]
        filtered_text.append(filtered_line)
        
    return filtered_text
    
text_numbers = limit_word_count(text_numbers)

# Process into fixed length arrays
    
processed_X = \
    keras.preprocessing.sequence.pad_sequences(text_numbers,
                                               value=0,
                                               padding='post',
                                               maxlen=512)

# Get prediction
predicted_classes = model.predict_classes(processed_X)
# The predicted classes give 0/1 for each possible class. As we only have one
# class we need to 'flatten' this array to remove nesting
predicted_classes = predicted_classes.flatten()

# Check prediction against known label
actual_classes = data_held_back['label'].values
accurate_prediction = predicted_classes == actual_classes
accuracy = accurate_prediction.mean()
print ('Accuracy on unseen data: %.2f' %accuracy)

107. Image recognition with TensorFlow

This code is based on TensorFlow’s own introductory example here. but with the addition of a ‘Confusion Matrix’ to better understand where mis-classification occurs.

For information on installing and using TensorFlow please see here. For more information on Confusion Matrices please see here.

This example will download the ‘minst-fashion’ data set of images which is a collection of 60,000 images of 10 different types of fashion items

Load libraries

"""
Code, apart from the confusion matrix,taken from:
 https://www.tensorflow.org/tutorials/keras/basic_classification#
"""

# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras

# Helper libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import itertools

Load minst fashion data set

If this is the first time you have used the data set it will automatically be downloaded from the internet. The data set loads as trainign and test images and labels.

# Load minst fashion data set
"""Minst fashion data set  is a collection of 70k images of 10 different
fashion items. It is loaded as training and test images and labels (60K training
images and 10K test images).

0 	T-shirt/top
1 	Trouser
2 	Pullover
3 	Dress
4 	Coat
5 	Sandal
6 	Shirt
7 	Sneaker
8 	Bag
9 	Ankle boot 
"""
(train_images, train_labels), (test_images, test_labels) = \
    fashion_mnist.load_data()

class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

Show an example image.

# Show an example image (an ankle boot)
plt.figure()
plt.imshow(train_images[0])
plt.colorbar()
plt.grid(False)
plt.show()

The image pixels currently range from 0 to 255. We will normalise to 0-1

# Scale images to values 0-1 (currently 0-255)
train_images = train_images / 255.0
test_images = test_images / 255.0

Plot the first 25 images with labels.

# Plot first 25 images with labels
plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i], cmap=plt.cm.binary)
    plt.xlabel(class_names[train_labels[i]])
plt.show()

Build the model

# Set up neural network layers
"""The first layer flattess the 28x28 image to a 1D array (784 pixels).
The second layer is a fully connected (dense) layer of 128 nodes/neurones.
The last layer is a 10 node softmax layer, giving probability of each class.
Softmax adjusts probabilities so that they total 1."""

model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation=tf.nn.relu),
    keras.layers.Dense(10, activation=tf.nn.softmax)])

# Compile the model
"""Optimizer: how model corrects itself and learns.
Loss function: How accurate the model is.
Metrics: How to monitor performance of model"""

model.compile(optimizer=tf.train.AdamOptimizer(), 
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

Train the model

Change the number of passes to balance accuracy vs. speed.

# Train the model (epochs is the number of times the training data is applied)
model.fit(train_images, train_labels, epochs=5)

Evaluate accuracy

# Evaluate accuracy
test_loss, test_acc = model.evaluate(test_images, test_labels)
print('Test accuracy:', test_acc)

Out: Test accuracy: 0.8732

Make predictions and show examples

# Make predictions
predictions = model.predict(test_images)
print ('\nClass propbabilities for test image 0')
print (predictions[0])
print ('\nPrdicted class for test image 0:', np.argmax(predictions[0]))
print ('Actual classification for test image 0:', test_labels[0])

# Plot image and predictions
def plot_image(i, predictions_array, true_label, img):
  predictions_array, true_label, img = predictions_array[i], true_label[i], img[i]
  plt.grid(False)
  plt.xticks([])
  plt.yticks([])
  
  plt.imshow(img, cmap=plt.cm.binary)

  predicted_label = np.argmax(predictions_array)
  if predicted_label == true_label:
    color = 'blue'
  else:
    color = 'red'
  
  plt.xlabel("{} {:2.0f}% ({})".format(class_names[predicted_label],
                                100*np.max(predictions_array),
                                class_names[true_label]),
                                color=color)

def plot_value_array(i, predictions_array, true_label):
  predictions_array, true_label = predictions_array[i], true_label[i]
  plt.grid(False)
  plt.xticks([])
  plt.yticks([])
  thisplot = plt.bar(range(10), predictions_array, color="#777777")
  plt.ylim([0, 1]) 
  predicted_label = np.argmax(predictions_array)
 
  thisplot[predicted_label].set_color('red')
  thisplot[true_label].set_color('blue')

# Plot images and graph for  selected images
# Blue bars shows actual classification
# Red bar shows an incorrect classificiation
num_rows = 6
num_cols = 3
num_images = num_rows*num_cols
plt.figure(figsize=(2*2*num_cols, 2*num_rows))
for i in range(num_images):
  plt.subplot(num_rows, 2*num_cols, 2*i+1)
  plot_image(i, predictions, test_labels, test_images)
  plt.subplot(num_rows, 2*num_cols, 2*i+2)
  plot_value_array(i, predictions, test_labels)
plt.show()

Calculate and show confusion matrix

You can see that misclassification is usually between similar types of objects, such as t-shirts and shirts

# SHOW CONFUSION MATRIX

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
  """
  This function prints and plots the confusion matrix.
  Normalization can be applied by setting `normalize=True`.
  """
  if normalize:
      cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
      cm = cm * 100
      print("\nNormalized confusion matrix")
  else:
      print('\nConfusion matrix, without normalization')
  print(cm)
  print ()

  plt.imshow(cm, interpolation='nearest', cmap=cmap)
  plt.title(title)
  plt.colorbar()
  tick_marks = np.arange(len(classes))
  plt.xticks(tick_marks, classes, rotation=45)
  plt.yticks(tick_marks, classes)

  fmt = '.0f' if normalize else 'd'
  thresh = cm.max() / 2.
  for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
      plt.text(j, i, format(cm[i, j], fmt),
                horizontalalignment="center",
                color="white" if cm[i, j] > thresh else "black")

  plt.tight_layout()
  plt.ylabel('True label')
  plt.xlabel('Predicted label')
  plt.show()

# Compute confusion matrix
y_pred = np.argmax(predictions, axis=1)
cnf_matrix = confusion_matrix(test_labels, y_pred)
np.set_printoptions(precision=2) # set NumPy to 2 decimal places

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')


Out:

Confusion matrix, without normalization
[[810   2  17  20   4   2 135   0  10   0]
 [  1 975   1  17   3   0   2   0   1   0]
 [ 10   0 855  12  67   0  56   0   0   0]
 [ 16  19  16 858  52   1  34   0   4   0]
 [  0   1 172  21 770   0  36   0   0   0]
 [  0   0   0   0   0 951   0  33   0  16]
 [112   3 142  27  83   0 624   0   9   0]
 [  0   0   0   0   0  12   0 966   0  22]
 [  2   0   7   3   6   4   7   3 968   0]
 [  0   0   0   0   0   5   1  39   0 955]]

Normalized confusion matrix
[[81.   0.2  1.7  2.   0.4  0.2 13.5  0.   1.   0. ]
 [ 0.1 97.5  0.1  1.7  0.3  0.   0.2  0.   0.1  0. ]
 [ 1.   0.  85.5  1.2  6.7  0.   5.6  0.   0.   0. ]
 [ 1.6  1.9  1.6 85.8  5.2  0.1  3.4  0.   0.4  0. ]
 [ 0.   0.1 17.2  2.1 77.   0.   3.6  0.   0.   0. ]
 [ 0.   0.   0.   0.   0.  95.1  0.   3.3  0.   1.6]
 [11.2  0.3 14.2  2.7  8.3  0.  62.4  0.   0.9  0. ]
 [ 0.   0.   0.   0.   0.   1.2  0.  96.6  0.   2.2]
 [ 0.2  0.   0.7  0.3  0.6  0.4  0.7  0.3 96.8  0. ]
 [ 0.   0.   0.   0.   0.   0.5  0.1  3.9  0.  95.5]]

Making a prediction from a single image

# Making a prediction of a single image
"""tf.keras models are optimized to make predictions on a batch, or collection,
of examples at once. So even though we're using a single image, we need to add
it to a list:"""

# Grab an example image
img = test_images[0]
# Add the image to a batch where it's the only member.
img = (np.expand_dims(img,0))
# Make prediction
predictions_single = model.predict(img)
# Plot results
plot_value_array(0, predictions_single, test_labels)
_ = plt.xticks(range(10), class_names, rotation=45)
plt.show()

MIT license for TensorFlow code

# MIT License
#
# Copyright (c) 2017 Fran├žois Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.

75. Machine learning: Choosing between models with stratified k-fold validation

In previous examples we have used multiple random sampling in order to obtain a better measurement of accuracy for modes (repeating the model with different random training/test splits).

A more robust method is to use ‘stratified k-fold validation’. In this method the model is repeated k times, so that all the data is used once, but only once, as part of the test set. This, alone, is k-fold validation. Stratified k-fold validation adds an extra level of robustness by ensuring that in each of the k training/test splits, the balance of outcomes represents the balance of outcomes in the overall data set. Most commonly 10 different splits of the data are used.

In this example we shall load up some data on treatment of acute stroke (data will be loaded from the internet). The model will try to predict whether patients are treated with a clot-busting drug. We will compare a number of different models using stratified k-fold validation.

We set this up with the commands:

from sklearn.model_selection import StratifiedKFold
splits = 10
skf = StratifiedKFold(n_splits = splits)
skf.get_n_splits(X, y)

And then we loop through the k splits with:

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
The full code:
"""Techniques applied:
    1. Random Forests
    2. Support Vector Machine (linear and rbf kernel)
    3. Logistic Regression
    4. Neural Network
"""

# %% Load modules

import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn import datasets
from sklearn.model_selection import StratifiedKFold


# %% Function to calculate sensitivity ans specificty
def calculate_diagnostic_performance(actual_predicted):
    """ Calculate sensitivty and specificty.
    Takes a Numpy array of 1 and zero, two columns: actual and predicted
    Returns a tuple of results:
    1) accuracy: proportion of test results that are correct    
    2) sensitivity: proportion of true +ve identified
    3) specificity: proportion of true -ve identified
    4) positive likelihood: increased probability of true +ve if test +ve
    5) negative likelihood: reduced probability of true +ve if test -ve
    6) false positive rate: proportion of false +ves in true -ve patients
    7) false negative rate:  proportion of false -ves in true +ve patients
    8) positive predictive value: chance of true +ve if test +ve
    9) negative predictive value: chance of true -ve if test -ve
    10) Count of test positives

    *false positive rate is the percentage of healthy individuals who 
    incorrectly receive a positive test result
    * alse neagtive rate is the percentage of diseased individuals who 
    incorrectly receive a negative test result
    
    """
    actual_predicted = test_results.values
    actual_positives = actual_predicted[:, 0] == 1
    actual_negatives = actual_predicted[:, 0] == 0
    test_positives = actual_predicted[:, 1] == 1
    test_negatives = actual_predicted[:, 1] == 0
    test_correct = actual_predicted[:, 0] == actual_predicted[:, 1]
    accuracy = np.average(test_correct)
    true_positives = actual_positives & test_positives
    true_negatives = actual_negatives & test_negatives
    sensitivity = np.sum(true_positives) / np.sum(actual_positives)
    specificity = np.sum(true_negatives) / np.sum(actual_negatives)
    positive_likelihood = sensitivity / (1 - specificity)
    negative_likelihood = (1 - sensitivity) / specificity
    false_postive_rate = 1 - specificity
    false_negative_rate = 1 - sensitivity
    positive_predictive_value = np.sum(true_positives) / np.sum(test_positives)
    negative_predicitive_value = np.sum(true_negatives) / np.sum(test_negatives)
    positive_rate = np.mean(actual_predicted[:,1])
    return (accuracy, sensitivity, specificity, positive_likelihood,
            negative_likelihood, false_postive_rate, false_negative_rate,
            positive_predictive_value, negative_predicitive_value, 
            positive_rate)


# %% Print diagnostics results
def print_diagnostic_results(results):
    # format all results to three decimal places
    three_decimals = ["%.3f" % v for v in results]
    print()
    print('Diagnostic results')
    print('  accuracy:\t\t\t', three_decimals[0])
    print('  sensitivity:\t\t\t', three_decimals[1])
    print('  specificity:\t\t\t', three_decimals[2])
    print('  positive likelyhood:\t\t', three_decimals[3])
    print('  negative likelyhood:\t\t', three_decimals[4])
    print('  false positive rate:\t\t', three_decimals[5])
    print('  false negative rate:\t\t', three_decimals[6])
    print('  positive predictive value:\t', three_decimals[7])
    print('  negative predicitve value:\t', three_decimals[8])
    print()


# %% Calculate weights from weights ratio:
# Set up class weighting to bias for sensiitivty vs. specificity
# Higher values increase sensitivity at the cost of specificity
def calculate_class_weights(positive_class_weight_ratio):
    positive_weight = ( positive_class_weight_ratio / 
                       (1 + positive_class_weight_ratio))
    
    negative_weight = 1 - positive_weight
    class_weights = {0: negative_weight, 1: positive_weight}
    return (class_weights)

#%% Create results folder if needed
# (Not used in this demo)   
# OUTPUT_LOCATION = 'results'
# if not os.path.exists(OUTPUT_LOCATION):
#    os.makedirs(OUTPUT_LOCATION)
    
# %% Import data
url = ("https://raw.githubusercontent.com/MichaelAllen1966/wordpress_blog" +
       "/master/jupyter_notebooks/stroke.csv")
df_stroke = pd.read_csv(url)
feat_labels = list(df_stroke)[1:]
number_of_features = len(feat_labels)
X, y = df_stroke.iloc[:, 1:].values, df_stroke.iloc[:, 0].values

# Set different weights for pisitive and negative results in SVM is required
# This will adjust balance between sensitivity and specificity
# For equal weighting, set at 1
positive_class_weight_ratio = 1
class_weights = calculate_class_weights(positive_class_weight_ratio)

# Set up strtified k-fold
splits = 10
skf = StratifiedKFold(n_splits = splits)
skf.get_n_splits(X, y)

# %% Set up results dataframes
forest_results = np.zeros((splits, 10))
forest_importance = np.zeros((splits, number_of_features))
svm_results_linear = np.zeros((splits, 10))
svm_results_rbf = np.zeros((splits, 10))
lr_results = np.zeros((splits, 10))
nn_results = np.zeros((splits, 10))

# %% Loop through the k splits of training/test data
loop_count = 0

for train_index, test_index in skf.split(X, y):
    
    print ('Split', loop_count + 1, 'out of', splits)

    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    sc = StandardScaler()  # new Standard Scalar object
    sc.fit(X_train)
    X_train_std = sc.transform(X_train)
    X_test_std = sc.transform(X_test)
    combined_results = pd.DataFrame()

    # %% Random forests
    forest = RandomForestClassifier(n_estimators=1000, n_jobs=-1, 
                                    class_weight='balanced')
    forest.fit(X_train, y_train)
    forest_importance[loop_count, :] = forest.feature_importances_
    y_pred = forest.predict(X_test)
    test_results = pd.DataFrame(np.vstack((y_test, y_pred)).T)
    diagnostic_performance = (calculate_diagnostic_performance
                              (test_results.values))
    forest_results[loop_count, :] = diagnostic_performance
    combined_results['Forest'] = y_pred

    # %% SVM (Support Vector Machine) Linear
    svm = SVC(kernel='linear', C=1.0, class_weight=class_weights)
    svm.fit(X_train_std, y_train)
    y_pred = svm.predict(X_test_std)
    test_results = pd.DataFrame(np.vstack((y_test, y_pred)).T)
    diagnostic_performance = (calculate_diagnostic_performance
                              (test_results.values))
    svm_results_linear[loop_count, :] = diagnostic_performance
    combined_results['SVM_linear'] = y_pred

    # %% SVM (Support Vector Machine) RBF
    svm = SVC(kernel='rbf', C=1.0)
    svm.fit(X_train_std, y_train)
    y_pred = svm.predict(X_test_std)
    test_results = pd.DataFrame(np.vstack((y_test, y_pred)).T)
    diagnostic_performance = (calculate_diagnostic_performance
                              (test_results.values))
    svm_results_rbf[loop_count, :] = diagnostic_performance
    combined_results['SVM_rbf'] = y_pred

    # %% Logistic Regression
    lr = LogisticRegression(C=100, class_weight=class_weights)
    lr.fit(X_train_std, y_train)
    y_pred = lr.predict(X_test_std)
    test_results = pd.DataFrame(np.vstack((y_test, y_pred)).T)
    diagnostic_performance = (calculate_diagnostic_performance
                              (test_results.values))
    lr_results[loop_count, :] = diagnostic_performance
    combined_results['LR'] = y_pred

    # %% Neural Network
    clf = MLPClassifier(solver='lbfgs', alpha=1e-8, hidden_layer_sizes=(50, 5),
                        max_iter=100000, shuffle=True, learning_rate_init=0.001,
                        activation='relu', learning_rate='constant', tol=1e-7)
    clf.fit(X_train_std, y_train)
    y_pred = clf.predict(X_test_std)
    test_results = pd.DataFrame(np.vstack((y_test, y_pred)).T)
    diagnostic_performance = (calculate_diagnostic_performance
                              (test_results.values))
    nn_results[loop_count, :] = diagnostic_performance
    combined_results['NN'] = y_pred
    
    # Increment loop count
    loop_count += 1

# %% Transfer results to Pandas arrays
results_summary = pd.DataFrame()

results_column_names = (['accuracy', 'sensitivity', 
                         'specificity',
                         'positive likelihood', 
                         'negative likelihood', 
                         'false positive rate', 
                         'false negative rate',
                         'positive predictive value',
                         'negative predictive value', 
                         'positive rate'])

forest_results_df = pd.DataFrame(forest_results)
forest_results_df.columns = results_column_names
forest_importance_df = pd.DataFrame(forest_importance)
forest_importance_df.columns = feat_labels
results_summary['Forest'] = forest_results_df.mean()

svm_results_lin_df = pd.DataFrame(svm_results_linear)
svm_results_lin_df.columns = results_column_names
results_summary['SVM_lin'] = svm_results_lin_df.mean()

svm_results_rbf_df = pd.DataFrame(svm_results_rbf)
svm_results_rbf_df.columns = results_column_names
results_summary['SVM_rbf'] = svm_results_rbf_df.mean()

lr_results_df = pd.DataFrame(lr_results)
lr_results_df.columns = results_column_names
results_summary['LR'] = lr_results_df.mean()

nn_results_df = pd.DataFrame(nn_results)
nn_results_df.columns = results_column_names
results_summary['Neural'] = nn_results_df.mean()


# %% Print summary results
print()
print('Results Summary:')
print(results_summary)

# %% Save files
# NOT USED IN THIS DEMO
# forest_results_df.to_csv('results/forest_results.csv')
# forest_importance_df.to_csv('results/forest_importance.csv')
# svm_results_lin_df.to_csv('results/svm_lin_results.csv')
# svm_results_rbf_df.to_csv('results/svm_rbf_results.csv')
# lr_results_df.to_csv('results/logistic_results.csv')
# nn_results_df.to_csv('results/neural_network_results.csv')
# results_summary.to_csv('results/results_summary.csv')
 Output:
Results Summary:
                             Forest   SVM_lin   SVM_rbf        LR    Neural
accuracy                   0.851946  0.839995  0.843081  0.839610  0.801859
sensitivity                0.727978  0.767511  0.741951  0.753473  0.702353
specificity                0.905567  0.871350  0.886804  0.876865  0.844867
positive likelihood        8.799893  7.396559  7.384775  7.390298  4.909178
negative likelihood        0.297522  0.263269  0.287613  0.276478  0.349459
false positive rate        0.094433  0.128650  0.113196  0.123135  0.155133
false negative rate        0.272022  0.232489  0.258049  0.246527  0.297647
positive predictive value  0.775919  0.731363  0.747641  0.737093  0.669270
negative predictive value  0.887152  0.898619  0.890479  0.894471  0.869677
positive rate              0.285678  0.321505  0.302999  0.313414  0.320310

73. Machine learning: neural networks

The last of our machine learning methods that we will look at in this introduction is neural networks.

Neural networks power much of modern image and voice recongition. They can cope with highly complex data, but often take large amounts of data to train well. There are many parameters that can be changes, so fine-tuning a neural net can require extensive work. We will not go into all the ways they may be fine-tuned here, but just look at a simple example. Continue reading “73. Machine learning: neural networks”