104: Using free text for classification – ‘Bag of Words’

There may be times in healthcare where we would like to classify patients based on free text data we have for them. Maybe, for example, we would like to predict likely outcome based on free text clinical notes.

Using free text requires methods known as ‘Natural Language Processing’.

Here we start with one of the simplest techniques – ‘bag of words’.

In a ‘bag of words’ free text is reduced to a vector (a series of numbers) that represent the number of times a word is used in the text we are given. It is also posible to look at series of two, three or more words in case use of two or more words together helps to classify a patient.

A classic ‘toy problem’ used to help teach or develop methos is to try to judge whether people rates a film as ‘like’ or ‘did not like’ based on the free text they entered into a widely used internet film review database (www.imdb.com).

Here will will use 50,000 records from IMDb to convert each review into a ‘bag of words’, which we will then use in a simple logistic regression machine learning model.

We can use raw word counts, but in this case we’ll add an extra transformation called tf-idf (frequency–inverse document frequency) which adjusts values according to the number fo reviews that use the word. Words that occur across many reviews may be less discriminatory than words that occur more rarely, so tf-idf reduces the value of those words used frequently across reviews.

This code will take us through the following steps:

1) Load data from internet

2) Clean data – remove non-text, convert to lower case, reduce words to their ‘stems’ (see below for details), and reduce common ‘stop-words’ (such as ‘as’, ‘the’, ‘of’).

3) Split data into training and test data sets

4) Convert cleaned reviews in word vectors (‘bag of words’), and apply the tf-idf transform.

5) Train a logistic regression model on the tr-idf transformed word vectors.

6) Apply the logistic regression model to our previously unseen test cases, and calculate accuracy of our model

Load data

import pandas as pd

# If you do not already have the data locally you may download (and save) by

file_location = 'https://gitlab.com/michaelallen1966/00_python_snippets' +\
    '_and_recipes/raw/master/machine_learning/data/IMDb.csv'
imdb = pd.read_csv(file_location)
# save to current directory
imdb.to_csv('imdb.csv', index=False)

# If you already have the data locally then you may run the following

# Load data example
imdb = pd.read_csv('imdb.csv')

# Truncate data for example if you want to speed up the example
# imdb = imdb.head(5000)

Define Function to preprocess data

This function, as previously described, works on raw text strings, and:

1) changes to lower case

2) tokenizes (breaks down into words

3) removes punctuation and non-word text

4) finds word stems

5) removes stop words

6) rejoins meaningful stem words

import nltk
import pandas as pd
import numpy as np
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# If not previously performed:
# nltk.download('stopwords')

stemming = PorterStemmer()
stops = set(stopwords.words("english"))

def apply_cleaning_function_to_list(X):
    cleaned_X = []
    for element in X:
        cleaned_X.append(clean_text(element))
    return cleaned_X


def clean_text(raw_text):
    """This function works on a raw text string, and:
        1) changes to lower case
        2) tokenizes (breaks down into words
        3) removes punctuation and non-word text
        4) finds word stems
        5) removes stop words
        6) rejoins meaningful stem words"""
    
    # Convert to lower case
    text = raw_text.lower()
    
    # Tokenize
    tokens = nltk.word_tokenize(text)
    
    # Keep only words (removes punctuation + numbers)
    # use .isalnum to keep also numbers
    token_words = [w for w in tokens if w.isalpha()]
    
    # Stemming
    stemmed_words = [stemming.stem(w) for w in token_words]
    
    # Remove stop words
    meaningful_words = [w for w in stemmed_words if not w in stops]
    
    # Rejoin meaningful stemmed words
    joined_words = ( " ".join(meaningful_words))
    
    # Return cleaned data
    return joined_words

Apply the data cleaning function (this may take a few minutes if you are using the full 50,000 reviews).

# Get text to clean
text_to_clean = list(imdb['review'])

# Clean text
cleaned_text = apply_cleaning_function_to_list(text_to_clean)

# Add cleaned data back into DataFrame
imdb['cleaned_review'] = cleaned_text

# Remove temporary cleaned_text list (after transfer to DataFrame)
del cleaned_text

Split data into training and test data sets

from sklearn.model_selection import train_test_split
X = list(imdb['cleaned_review'])
y = list(imdb['sentiment'])
X_train, X_test, y_train, y_test = train_test_split(
    X,y, test_size = 0.25)

Create ‘bag of words’

The ‘bag of words’ is the word vector for each review. This may be a simple word count for each review where each position of the vector represents a word (returned in the ‘vocab’ list) and the value of that position represents the number of times that word is used in the review.

The function below also returns a tf-idf (frequency–inverse document frequency) which adjusts values according to the number fo reviews that use the word. Words that occur across many reviews may be less discriminatory than words that occur more rarely. The tf-idf transform reduces the value of a given word in proportion to the number of documents that it appears in.

The function returns the following:

1) vectorizer – this may be applied to any new reviews to convert the revier into the same word vector as the training set.

2) vocab – the list of words that the word vectors refer to.

3) train_data_features – raw word count vectors for each review

4) tfidf_features – tf-idf transformed word vectors

5) tfidf – the tf-idf transformation that may be applied to new reviews to convert the raw word counts into the transformed word counts in the same way as the training data.

Our vectorizer has an argument called ‘ngram_range’. A simple bag of words divides reviews into single words. If we have an ngram_range of (1,2) it means that the review is divided into single words and also pairs of consecutive words. This may be useful if pairs of words are useful, such as ‘very good’. The max_features argument limits the size of the word vector, in this case to a maximum of 10,000 words (or 10,000 ngrams of words if an ngram may be more than one word).

def create_bag_of_words(X):
    from sklearn.feature_extraction.text import CountVectorizer
    
    print ('Creating bag of words...')
    # Initialize the "CountVectorizer" object, which is scikit-learn's
    # bag of words tool.  
    
    # In this example features may be single words or two consecutive words
    # (as shown by ngram_range = 1,2)
    vectorizer = CountVectorizer(analyzer = "word",   \
                                 tokenizer = None,    \
                                 preprocessor = None, \
                                 stop_words = None,   \
                                 ngram_range = (1,2), \
                                 max_features = 10000
                                ) 

    # fit_transform() does two functions: First, it fits the model
    # and learns the vocabulary; second, it transforms our training data
    # into feature vectors. The input to fit_transform should be a list of 
    # strings. The output is a sparse array
    train_data_features = vectorizer.fit_transform(X)
    
    # Convert to a NumPy array for easy of handling
    train_data_features = train_data_features.toarray()
    
    # tfidf transform
    from sklearn.feature_extraction.text import TfidfTransformer
    tfidf = TfidfTransformer()
    tfidf_features = tfidf.fit_transform(train_data_features).toarray()

    # Get words in the vocabulary
    vocab = vectorizer.get_feature_names()
   
    return vectorizer, vocab, train_data_features, tfidf_features, tfidf

Apply our bag of words function to our training set.

vectorizer, vocab, train_data_features, tfidf_features, tfidf  = \
    create_bag_of_words(X_train)

We can create a DataFrame of our words and counts, so that we may sort and view them. The count and tfidf_features exist for each X (each review in this case) – here we will look at just the first review (index 0).

Note that the tfidf_features differ from the count; that is because of the adjustment for how commonly they occur across reviews.

(Try changing the sort to sort by tfidf_features).

bag_dictionary = pd.DataFrame()
bag_dictionary['ngram'] = vocab
bag_dictionary['count'] = train_data_features[0]
bag_dictionary['tfidf_features'] = tfidf_features[0]

# Sort by raw count
bag_dictionary.sort_values(by=['count'], ascending=False, inplace=True)
# Show top 10
print(bag_dictionary.head(10))

Out:

         ngram  count  tfidf_features
9320        wa      4        0.139373
5528      movi      3        0.105926
9728     whole      2        0.160024
3473    german      2        0.249079
6327      part      2        0.140005
293   american      1        0.089644
9409   wa kind      1        0.160155
9576      wast      1        0.087894
7380       saw      1        0.078477
7599      sens      1        0.085879

Training a machine learning model on the bag of words

Now we have transformed our free text reviews in vectors of numbers (representing words) we can apply many different machine learning techniques. Here will will use a relatively simple one, logistic regression.

We’ll set up a function to train a logistic regression model.

def train_logistic_regression(features, label):
    print ("Training the logistic regression model...")
    from sklearn.linear_model import LogisticRegression
    ml_model = LogisticRegression(C = 100,random_state = 0)
    ml_model.fit(features, label)
    print ('Finished')
    return ml_model

Now we will use the tf-idf tranformed word vectors to train the model (we could use the plain word counts contained in ‘train_data_features’ (rather than using ’tfidf_features’). We pass both the features and the known label corresponding to the review (the sentiment, either 0 or 1 depending on whether a person likes the film or not.

ml_model = train_logistic_regression(tfidf_features, y_train)

Applying the bag of words model to test reviews

We will now apply the bag of words model to test reviews, and assess the accuracy.

We’ll first apply our vectorizer to create a word vector for review in the test data set.

test_data_features = vectorizer.transform(X_test)
# Convert to numpy array
test_data_features = test_data_features.toarray()

As we are using the tf-idf transform, we’ll apply the tfid transformer so that word vectors are transformed in the same way as the training data set.

test_data_tfidf_features = tfidf.fit_transform(test_data_features)
# Convert to numpy array
test_data_tfidf_features = test_data_tfidf_features.toarray()

Now the bit that we really want to do – we’ll predict the sentiment of the all test reviews (and it’s just a single line of code!). Did they like the film or not?

predicted_y = ml_model.predict(test_data_tfidf_features)
correctly_identified_y = predicted_y == y_test
accuracy = np.mean(correctly_identified_y) * 100
print ('Accuracy = %.0f%%' %accuracy)

Out:

Accuracy = 87%

97: Simple machine learning model to predict emergency department (ED) breaches of the four-hour target

In England emergency departments have a target that 95% of patients should be admitted or discharged from ED within four hours. Patients waiting more than four hours are known as ‘breaches’

This notebook explores predicting emergency department (ED) breaches (patients taking more than 4 hours to be discharged or admitted). The data is from a real mid-sized acute hospital in England.

The model receives data every 2 hours and predicts whether there will be a breach in the next 2 hours.

It uses some basic ED data alongside whole-hospital data (number of occupied beds and total beds) to try to predict whether there are likely to be breaches in the next two hours. It uses a simple logistic regression model to achieve 80% accuracy in predicting breaches. Sensitivity may be adjusted to balance accuracy in predicting beach and non-breaching episodes (80% accuracy may be be simultaneousness achieved in both).

import pandas as pd
data = pd.read_csv('ed_1.csv')

Show data columns:

print (list(data))
['snapshot_id', 'snapshot_date', 'snapshot_time', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Number of Patients In department >= 4 Hours', 'Total Number of Patients in the Department', 'Number of Patients in Resus', 'Number of Patients Registered in Last 60 Minutes', 'Number of Patients Waiting Triage', 'Number of Patients Waiting to be Seen (ED)', 'Number of Patients Waiting to be Seen (Medical)', 'Number of Patients Waiting to be Seen (Surgery)', 'Number of Patients > 3 Hours', 'Number of Patients Waiting a Bed', 'Number of Patients Left Department in Last 60 Minutes', 'Free_beds', 'Breach_in_next_timeslot']

Separate data into features (X) and label (Y) to predict. Y is whether there are breaches in the following 2 hours.

X = data.loc[:,"Monday":"Free_beds"]
y = data['Breach_in_next_timeslot']

Let’s see what proportion of 2 hour epochs have a breach:

print (data['Breach_in_next_timeslot'].mean())
0.6575510659671838

Split data in training and test sets

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25)

Normalise data with standard scaling

from sklearn.preprocessing import StandardScaler

# Initialise a new scaling object for normalising input data
sc=StandardScaler()

# Set up the scaler just on the training set
sc.fit(X_train)

# Apply the scaler to the training and test sets
X_train_std=sc.transform(X_train)
X_test_std=sc.transform(X_test)

Build a logistic regression model

C=1000 sets low regularisation. If accuracy of training data is significantly higher than accuracy of test data this should be reduced in 10-fold or 3-fold steps to maximise accuracy of test data.

(Note: the ‘;’ at the end of the last line suppresses model description output in the Jupyter Notebook)

from sklearn.linear_model import LogisticRegression

ml = LogisticRegression(C=1000)
ml.fit(X_train_std,y_train);

Predict training and test set labels

Our model is now built. We can now predict breaches for training and test sets. The results for the test set gives the better description of accuracy, but it is useful to calculate both to look for ‘over-fitting’. If the training data has significantly better accuracy than the test data then it is likely the model is ‘over-fitted’ to the training data, and the regularisation term (C) in the model fit above should be reduced step-wise – this will reduce accuracy of predicting the training data, but will increase the accuracy of the test data, though too high regularisation (low C) will reduce the accuracy of both predicting training and test data.

In [8]:
# Predict training and test set labels
y_pred_train = ml.predict(X_train_std)
y_pred_test = ml.predict(X_test_std)

Test accuracy

import numpy as np
accuracy_train = np.mean(y_pred_train == y_train)
accuracy_test = np.mean(y_pred_test == y_test)
print ('Accuracy of predicting training data =', accuracy_train)
print ('Accuracy of predicting test data =', accuracy_test)
Accuracy of predicting training data = 0.8111326090191993
Accuracy of prediciing test data = 0.8151785714285714

Display weights (coefficients) of model.

# Create table of weights
weights_table = pd.DataFrame()
weights_table['feature'] = list(X)
weights_table['weight'] = ml.coef_[0]
print(weights_table)
                                              feature    weight
0                                              Monday  0.038918
1                                             Tuesday -0.026935
2                                           Wednesday  0.001615
3                                            Thursday  0.001543
4                                              Friday -0.014975
5                                            Saturday  0.011287
6                                              Sunday -0.011401
7         Number of Patients In department >= 4 Hours  1.515722
8          Total Number of Patients in the Department  0.544407
9                         Number of Patients in Resus  0.307983
10   Number of Patients Registered in Last 60 Minutes -0.444304
11                  Number of Patients Waiting Triage  0.028371
12         Number of Patients Waiting to be Seen (ED)  0.138082
13    Number of Patients Waiting to be Seen (Medical) -0.036093
14    Number of Patients Waiting to be Seen (Surgery)  0.022757
15                       Number of Patients > 3 Hours  1.265580
16                   Number of Patients Waiting a Bed  0.013085
17  Number of Patients Left Department in Last 60 ... -0.001884
18                                          Free_beds -0.369558

 

Define a function for sensitivity and specificity

Sensitivity = proportion of breaching periods correctly identified
Specificity = proportion of breaching periods correctly identified

def calculate_sensitivity_specificity(y_test, y_pred_test):
    # Note: More parameters are defined than necessary. 
    # This would allow return of other measures other than sensitivity and specificity
    
    # Get true/false for whether a breach actually occurred
    actual_pos = y_test == 1
    actual_neg = y_test == 0
    
    # Get true and false test (true test match actual, false tests differ from actual)
    true_pos = (y_pred_test == 1) & (actual_pos)
    false_pos = (y_pred_test == 1) & (actual_neg)
    true_neg = (y_pred_test == 0) & (actual_neg)
    false_neg = (y_pred_test == 0) & (actual_pos)
    
    # Calculate accuracy
    accuracy = np.mean(y_pred_test == y_test)
    
    # Calculate sensitivity and specificity
    sensitivity = np.sum(true_pos) / np.sum(actual_pos)
    specificity = np.sum(true_neg) / np.sum(actual_neg)
    
    return sensitivity, specificity, accuracy

Show sensitivity and specificity:

sensitivity, specificity, accuracy = calculate_sensitivity_specificity(y_test, y_pred_test)
print ('Sensitivity:', sensitivity)
print ('Specificity:', specificity)
print ('Accuracy:', accuracy)
Sensitivity: 0.8488529014844804
Specificity: 0.7493403693931399
Accuracy: 0.8151785714285714

So we are better at detecting breaches than non-breaches. This is likely because breaching sessions occur more often. Let’s adjust our model cut-off to balance the accuracy out. We’ll vary the cut-off we use and construct a sensitivity/specificity plot (very similar to a ‘Receiver-Operator Curve’ or ‘ROC’).

Balancing sensitivity and specificity

cuttoff = np.arange (0.01,1.01,0.01)
sensitivity_results = []
specificity_results = []


for threshold in cuttoff:
    # linear regression model has .predict+proba  method to return 
    # probability of outcomes. Some methods such as svc use 
    # .decision_function to return probability
        
    # Get test results 
    y_pred_probability = ml.predict_proba(X_test_std)
    
    # Check probability of positive classification is >trhreshold
    y_pred_test = (y_pred_probability[:,1] >= threshold)
    
    # Convert boolean to 0/1 (could also simply multiple by 1)
    y_pred_test = y_pred_test.astype(int)
    
    # Get sensitivity and specificity
    sensitivity, specificity, accuracy = \
        calculate_sensitivity_specificity(y_test, y_pred_test)
    
    # Add results to list of results
    sensitivity_results.append(sensitivity)
    specificity_results.append(specificity)  
    

Plotting specificity against sensitivity:

import matplotlib.pyplot as plt

%matplotlib inline

fig = plt.figure(figsize=(5,5))
ax1 = fig.add_subplot(111)

x = sensitivity_results
y = specificity_results

ax1.grid(True, which='both')
ax1.set_xlabel('Sensitivity (proportion of breaching\nperiods predicted correctly)')
ax1.set_ylabel('Specificity (proportion of non-breaching\nperiods predicted correctly)')


plt.plot(x,y)
plt.show()
ed_roc

Plotting specificity against sensitivity shows we can adjust our machine learning cut-off to simultaneously achieve 80% accuracy in predicting likelihood of breaches in the next 2 hours.

85. Using free text for classification – ‘Bag of Words’

There may be times in healthcare where we would like to classify patients based on free text data we have for them. Maybe, for example, we would like to predict likely outcome based on free text clinical notes.

Using free text requires methods known as ‘Natural Language Processing’.

Here we start with one of the simplest techniques – ‘bag of words’.

In a ‘bag of words’ free text is reduced to a vector (a series of numbers) that represent the number of times a word is used in the text we are given. It is also posible to look at series of two, three or more words in case use of two or more words together helps to classify a patient.

A classic ‘toy problem’ used to help teach or develop methos is to try to judge whether people rates a film as ‘like’ or ‘did not like’ based on the free text they entered into a widely used internet film review database (www.imdb.com).

Here will will use 50,000 records from IMDb to convert each review into a ‘bag of words’, which we will then use in a simple logistic regression machine learning model.

We can use raw word counts, but in this case we’ll add an extra transformation called tf-idf (frequency–inverse document frequency) which adjusts values according to the number fo reviews that use the word. Words that occur across many reviews may be less discriminatory than words that occur more rarely, so tf-idf reduces the value of those words used frequently across reviews.

This code will take us through the following steps:

1) Load data from internet, and split into training and test sets.

2) Clean data – remove non-text, convert to lower case, reduce words to their ‘stems’ (see below for details), and reduce common ‘stop-words’ (such as ‘as’, ‘the’, ‘of’).

3) Convert cleaned reviews in word vectors (‘bag of words’), and apply the tf-idf transform.

4) Train a logistic regression model on the tr-idf transformed word vectors.

5) Apply the logistic regression model to our previously unseen test cases, and calculate accuracy of our model.

Load data

This will load the IMDb data from the web. It is loaded into a Pandas DataFrame.

import pandas as pd
import numpy as np

file_location = 'https://gitlab.com/michaelallen1966/00_python_snippets_and_recipes/raw/master/machine_learning/data/IMDb.csv'
data = pd.read_csv(file_location)

Show the size of the data set (rows, columns).

data.shape
Out:
(50000, 2)

Show the data fields.

list(data)
Out:
['review', 'sentiment']

Show the first record review and recorded sentiment (which will be 0 for not liked, or 1 for liked)

In [11
print ('Review:')
print (data['review'].iloc[0])
print ('\nSentiment (label):')
print (data['sentiment'].iloc[0])
  Out:
Review:
I have no read the novel on which "The Kite Runner" is based. My wife and daughter, who did, thought the movie fell a long way short of the book, and I'm prepared to take their word for it. But, on its own, the movie is good -- not great but good. How accurately does it portray the havoc created by the Soviet invasion of Afghanistan? How convincingly does it show the intolerant Taliban regime that followed? I'd rate it C+ on the first and B+ on the second. The human story, the Afghan-American who returned to the country to rescue the son of his childhood playmate, is well done but it is on this count particularly that I'm told the book was far more convincing than the movie. The most exciting part of the film, however -- the kite contests in Kabul and, later, a mini-contest in California -- cannot have been equaled by the book. I'd wager money on that.

Sentiment (label):
1

Splitting the data into training and test sets

Split the data into 70% training data and 30% test data. The model will be trained using the training data, and accuracy will be tested using the independent test data.

from sklearn.model_selection import train_test_split
X = list(data['review'])
y = list(data['sentiment'])
X_train, X_test, y_train, y_test = train_test_split(
    X,y, test_size = 0.3, random_state = 0)

Defining a function to clean the text

This function will:

1) Remove ant HTML commands in the text

2) Remove non-letters (e.g. punctuation)

3) Convert all words to lower case

4) Remove stop words (stop words are commonly used works like ‘and’ and ‘the’ which have little value in nag of words. If stop words are not already installed then open a python terminal and type the two following lines of code (these instructions will also be given when running this code if the stopwords have not already been downloaded onto the computer running this code).

import nltk

nltk.download(“stopwords”)

5) Reduce words to stem of words (e.g. ‘runner’, ‘running’, and ‘runs’ will all be converted to ‘run’)

6) Join words back up into a single string

def clean_text(raw_review):
    # Function to convert a raw review to a string of words
    
    # Import modules
    from bs4 import BeautifulSoup
    import re

    # Remove HTML
    review_text = BeautifulSoup(raw_review, 'lxml').get_text() 

    # Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    
    # Convert to lower case, split into individual words
    words = letters_only.lower().split()   

    # Remove stop words (use of sets makes this faster)
    from nltk.corpus import stopwords
    stops = set(stopwords.words("english"))                  
    meaningful_words = [w for w in words if not w in stops]                             

    # Reduce word to stem of word
    from nltk.stem.porter import PorterStemmer
    porter = PorterStemmer()
    stemmed_words = [porter.stem(w) for w in meaningful_words]

    # Join the words back into one string separated by space
    joined_words = ( " ".join( stemmed_words ))
    return joined_words 

Now will will define a function that will apply the cleaning function to a series of records (the clean text function works on one string of text at a time).

def apply_cleaning_function_to_series(X):
    print('Cleaning data')
    cleaned_X = []
    for element in X:
        cleaned_X.append(clean_text(element))
    print ('Finished')
    return cleaned_X

We will call the cleaning functions to clean the text of both the training and the test data. This may take a little time.

X_train_clean = apply_cleaning_function_to_series(X_train)
X_test_clean = apply_cleaning_function_to_series(X_test)
  Out:
Cleaning data
Finished
Cleaning data
Finished

Create ‘bag of words’

The ‘bag of words’ is the word vector for each review. This may be a simple word count for each review where each position of the vector represnts a word (returned in the ‘vocab’ list) and the value of that position represents the number fo times that word is used in the review.

The function below also returns a tf-idf (frequency–inverse document frequency) which adjusts values according to the number fo reviews that use the word. Words that occur across many reviews may be less discriminatory than words that occur more rarely. The tf-idf transorm reduces the value of a given word in proportion to the number of documents that it appears in.

The function returns the following:

1) vectorizer – this may be applied to any new reviews to convert the revier into the same word vector as the training set.

2) vocab – the list of words that the word vectors refer to.

3) train_data_features – raw word count vectors for each review

4) tfidf_features – tf-idf transformed word vectors

5) tfidf – the tf-idf transformation that may be applied to new reviews to convert the raw word counts into the transformed word counts in the same way as the training data.

Our vectorizer has an argument called ‘ngram_range’. A simple bag of words divides reviews into single words. If we have an ngram_range of (1,2) it means that the review is divided into single words and also pairs of consecutiev words. This may be useful if pairs of words are useful, such as ‘very good’. The max_features argument limits the size of the word vector, in this case to a maximum of 10,000 words (or 10,000 ngrams of words if an ngram may be mor than one word).

def create_bag_of_words(X):
    from sklearn.feature_extraction.text import CountVectorizer
    
    print ('Creating bag of words...')
    # Initialize the "CountVectorizer" object, which is scikit-learn's
    # bag of words tool.  
    
    # In this example features may be single words or two consecutive words
    vectorizer = CountVectorizer(analyzer = "word",   \
                                 tokenizer = None,    \
                                 preprocessor = None, \
                                 stop_words = None,   \
                                 ngram_range = (1,2), \
                                 max_features = 10000) 

    # fit_transform() does two functions: First, it fits the model
    # and learns the vocabulary; second, it transforms our training data
    # into feature vectors. The input to fit_transform should be a list of 
    # strings. The output is a sparse array
    train_data_features = vectorizer.fit_transform(X)
    
    # Convert to a NumPy array for easy of handling
    train_data_features = train_data_features.toarray()
    
    # tfidf transform
    from sklearn.feature_extraction.text import TfidfTransformer
    tfidf = TfidfTransformer()
    tfidf_features = tfidf.fit_transform(train_data_features).toarray()

    # Take a look at the words in the vocabulary
    vocab = vectorizer.get_feature_names()
   
    return vectorizer, vocab, train_data_features, tfidf_features, tfidf

We will apply our bag_of_words function to our training set. Again this might take a little time.

vectorizer, vocab, train_data_features, tfidf_features, tfidf  = (
        create_bag_of_words(X_train_clean))
  Out:
Creating bag of words...

Let’s look at the some items from the vocab list (positions 40-44). Some of the words may seem odd. That is because of the stemming.

vocab[40:45]
Out:
['accomplish', 'accord', 'account', 'accur', 'accuraci']

And we can see the raw word count represented in train_data_features.

train_data_features[0][40:45]
Out:
array([0, 0, 1, 0, 0], dtype=int64)

If we look at the tf-idf transform we can see the value reduced (words occuring in many documents will have their value reduced the most)

tfidf_features[0][40:45]
Out:
array([0.        , 0.        , 0.06988648, 0.        , 0.        ])

Training a machine learning model on the bag of words

Now we have transformed our free text reviews in vectors of numebrs (representing words) we can apply many different machine learning techniques. Here will will use a relatively simple one, logistic regression.

We’ll set up a function to train a logistic regression model.

def train_logistic_regression(features, label):
    print ("Training the logistic regression model...")
    from sklearn.linear_model import LogisticRegression
    ml_model = LogisticRegression(C = 100,random_state = 0)
    ml_model.fit(features, label)
    print ('Finished')
    return ml_model

Now we will use the tf-idf tranformed word vectors to train the model (we could use the plain word counts contained in ‘train_data_features’ (rather than using’tfidf_features’). We pass both the features and the known label corresponding to the review (the sentiment, either 0 or 1 depending on whether a person likes the film or not.

ml_model = train_logistic_regression(tfidf_features, y_train)
  Out:
Training the logistic regression model...
Finished

Applying the bag of words model to test reviews

We will now apply the bag of words model to test reviews, and assess the accuracy.

We’ll first apply our vectorizer to create a word vector for review in the test data set.

test_data_features = vectorizer.transform(X_test_clean)
# Convert to numpy array
test_data_features = test_data_features.toarray()

As we are using the tf-idf transform, we’ll apply the tfid transformer so that word vectors are transformed in the same way as the training data set.

test_data_tfidf_features = tfidf.fit_transform(test_data_features)
# Convert to numpy array
test_data_tfidf_features = test_data_tfidf_features.toarray()

Now the bit that we really want to do – we’ll predict the sentiment of the all test reviews (and it’s just a single line of code!). Did they like the film or not?

predicted_y = ml_model.predict(test_data_tfidf_features)

Now we’ll compare the predicted sentiment to the actual sentiment, and show the overall accuracy of this model.

correctly_identified_y = predicted_y == y_test
accuracy = np.mean(correctly_identified_y) * 100
print ('Accuracy = %.0f%%' %accuracy)
 Out:
Accuracy = 87%

87% accuracy. That’s not bad for a simple Natural Language Processing model, using logistic regression.

75. Machine learning: Choosing between models with stratified k-fold validation

In previous examples we have used multiple random sampling in order to obtain a better measurement of accuracy for modes (repeating the model with different random training/test splits).

A more robust method is to use ‘stratified k-fold validation’. In this method the model is repeated k times, so that all the data is used once, but only once, as part of the test set. This, alone, is k-fold validation. Stratified k-fold validation adds an extra level of robustness by ensuring that in each of the k training/test splits, the balance of outcomes represents the balance of outcomes in the overall data set. Most commonly 10 different splits of the data are used.

In this example we shall load up some data on treatment of acute stroke (data will be loaded from the internet). The model will try to predict whether patients are treated with a clot-busting drug. We will compare a number of different models using stratified k-fold validation.

We set this up with the commands:

from sklearn.model_selection import StratifiedKFold
splits = 10
skf = StratifiedKFold(n_splits = splits)
skf.get_n_splits(X, y)

And then we loop through the k splits with:

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
The full code:
"""Techniques applied:
    1. Random Forests
    2. Support Vector Machine (linear and rbf kernel)
    3. Logistic Regression
    4. Neural Network
"""

# %% Load modules

import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn import datasets
from sklearn.model_selection import StratifiedKFold


# %% Function to calculate sensitivity ans specificty
def calculate_diagnostic_performance(actual_predicted):
    """ Calculate sensitivty and specificty.
    Takes a Numpy array of 1 and zero, two columns: actual and predicted
    Returns a tuple of results:
    1) accuracy: proportion of test results that are correct    
    2) sensitivity: proportion of true +ve identified
    3) specificity: proportion of true -ve identified
    4) positive likelihood: increased probability of true +ve if test +ve
    5) negative likelihood: reduced probability of true +ve if test -ve
    6) false positive rate: proportion of false +ves in true -ve patients
    7) false negative rate:  proportion of false -ves in true +ve patients
    8) positive predictive value: chance of true +ve if test +ve
    9) negative predictive value: chance of true -ve if test -ve
    10) Count of test positives

    *false positive rate is the percentage of healthy individuals who 
    incorrectly receive a positive test result
    * alse neagtive rate is the percentage of diseased individuals who 
    incorrectly receive a negative test result
    
    """
    actual_predicted = test_results.values
    actual_positives = actual_predicted[:, 0] == 1
    actual_negatives = actual_predicted[:, 0] == 0
    test_positives = actual_predicted[:, 1] == 1
    test_negatives = actual_predicted[:, 1] == 0
    test_correct = actual_predicted[:, 0] == actual_predicted[:, 1]
    accuracy = np.average(test_correct)
    true_positives = actual_positives & test_positives
    true_negatives = actual_negatives & test_negatives
    sensitivity = np.sum(true_positives) / np.sum(actual_positives)
    specificity = np.sum(true_negatives) / np.sum(actual_negatives)
    positive_likelihood = sensitivity / (1 - specificity)
    negative_likelihood = (1 - sensitivity) / specificity
    false_postive_rate = 1 - specificity
    false_negative_rate = 1 - sensitivity
    positive_predictive_value = np.sum(true_positives) / np.sum(test_positives)
    negative_predicitive_value = np.sum(true_negatives) / np.sum(test_negatives)
    positive_rate = np.mean(actual_predicted[:,1])
    return (accuracy, sensitivity, specificity, positive_likelihood,
            negative_likelihood, false_postive_rate, false_negative_rate,
            positive_predictive_value, negative_predicitive_value, 
            positive_rate)


# %% Print diagnostics results
def print_diagnostic_results(results):
    # format all results to three decimal places
    three_decimals = ["%.3f" % v for v in results]
    print()
    print('Diagnostic results')
    print('  accuracy:\t\t\t', three_decimals[0])
    print('  sensitivity:\t\t\t', three_decimals[1])
    print('  specificity:\t\t\t', three_decimals[2])
    print('  positive likelyhood:\t\t', three_decimals[3])
    print('  negative likelyhood:\t\t', three_decimals[4])
    print('  false positive rate:\t\t', three_decimals[5])
    print('  false negative rate:\t\t', three_decimals[6])
    print('  positive predictive value:\t', three_decimals[7])
    print('  negative predicitve value:\t', three_decimals[8])
    print()


# %% Calculate weights from weights ratio:
# Set up class weighting to bias for sensiitivty vs. specificity
# Higher values increase sensitivity at the cost of specificity
def calculate_class_weights(positive_class_weight_ratio):
    positive_weight = ( positive_class_weight_ratio / 
                       (1 + positive_class_weight_ratio))
    
    negative_weight = 1 - positive_weight
    class_weights = {0: negative_weight, 1: positive_weight}
    return (class_weights)

#%% Create results folder if needed
# (Not used in this demo)   
# OUTPUT_LOCATION = 'results'
# if not os.path.exists(OUTPUT_LOCATION):
#    os.makedirs(OUTPUT_LOCATION)
    
# %% Import data
url = ("https://raw.githubusercontent.com/MichaelAllen1966/wordpress_blog" +
       "/master/jupyter_notebooks/stroke.csv")
df_stroke = pd.read_csv(url)
feat_labels = list(df_stroke)[1:]
number_of_features = len(feat_labels)
X, y = df_stroke.iloc[:, 1:].values, df_stroke.iloc[:, 0].values

# Set different weights for pisitive and negative results in SVM is required
# This will adjust balance between sensitivity and specificity
# For equal weighting, set at 1
positive_class_weight_ratio = 1
class_weights = calculate_class_weights(positive_class_weight_ratio)

# Set up strtified k-fold
splits = 10
skf = StratifiedKFold(n_splits = splits)
skf.get_n_splits(X, y)

# %% Set up results dataframes
forest_results = np.zeros((splits, 10))
forest_importance = np.zeros((splits, number_of_features))
svm_results_linear = np.zeros((splits, 10))
svm_results_rbf = np.zeros((splits, 10))
lr_results = np.zeros((splits, 10))
nn_results = np.zeros((splits, 10))

# %% Loop through the k splits of training/test data
loop_count = 0

for train_index, test_index in skf.split(X, y):
    
    print ('Split', loop_count + 1, 'out of', splits)

    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    sc = StandardScaler()  # new Standard Scalar object
    sc.fit(X_train)
    X_train_std = sc.transform(X_train)
    X_test_std = sc.transform(X_test)
    combined_results = pd.DataFrame()

    # %% Random forests
    forest = RandomForestClassifier(n_estimators=1000, n_jobs=-1, 
                                    class_weight='balanced')
    forest.fit(X_train, y_train)
    forest_importance[loop_count, :] = forest.feature_importances_
    y_pred = forest.predict(X_test)
    test_results = pd.DataFrame(np.vstack((y_test, y_pred)).T)
    diagnostic_performance = (calculate_diagnostic_performance
                              (test_results.values))
    forest_results[loop_count, :] = diagnostic_performance
    combined_results['Forest'] = y_pred

    # %% SVM (Support Vector Machine) Linear
    svm = SVC(kernel='linear', C=1.0, class_weight=class_weights)
    svm.fit(X_train_std, y_train)
    y_pred = svm.predict(X_test_std)
    test_results = pd.DataFrame(np.vstack((y_test, y_pred)).T)
    diagnostic_performance = (calculate_diagnostic_performance
                              (test_results.values))
    svm_results_linear[loop_count, :] = diagnostic_performance
    combined_results['SVM_linear'] = y_pred

    # %% SVM (Support Vector Machine) RBF
    svm = SVC(kernel='rbf', C=1.0)
    svm.fit(X_train_std, y_train)
    y_pred = svm.predict(X_test_std)
    test_results = pd.DataFrame(np.vstack((y_test, y_pred)).T)
    diagnostic_performance = (calculate_diagnostic_performance
                              (test_results.values))
    svm_results_rbf[loop_count, :] = diagnostic_performance
    combined_results['SVM_rbf'] = y_pred

    # %% Logistic Regression
    lr = LogisticRegression(C=100, class_weight=class_weights)
    lr.fit(X_train_std, y_train)
    y_pred = lr.predict(X_test_std)
    test_results = pd.DataFrame(np.vstack((y_test, y_pred)).T)
    diagnostic_performance = (calculate_diagnostic_performance
                              (test_results.values))
    lr_results[loop_count, :] = diagnostic_performance
    combined_results['LR'] = y_pred

    # %% Neural Network
    clf = MLPClassifier(solver='lbfgs', alpha=1e-8, hidden_layer_sizes=(50, 5),
                        max_iter=100000, shuffle=True, learning_rate_init=0.001,
                        activation='relu', learning_rate='constant', tol=1e-7)
    clf.fit(X_train_std, y_train)
    y_pred = clf.predict(X_test_std)
    test_results = pd.DataFrame(np.vstack((y_test, y_pred)).T)
    diagnostic_performance = (calculate_diagnostic_performance
                              (test_results.values))
    nn_results[loop_count, :] = diagnostic_performance
    combined_results['NN'] = y_pred
    
    # Increment loop count
    loop_count += 1

# %% Transfer results to Pandas arrays
results_summary = pd.DataFrame()

results_column_names = (['accuracy', 'sensitivity', 
                         'specificity',
                         'positive likelihood', 
                         'negative likelihood', 
                         'false positive rate', 
                         'false negative rate',
                         'positive predictive value',
                         'negative predictive value', 
                         'positive rate'])

forest_results_df = pd.DataFrame(forest_results)
forest_results_df.columns = results_column_names
forest_importance_df = pd.DataFrame(forest_importance)
forest_importance_df.columns = feat_labels
results_summary['Forest'] = forest_results_df.mean()

svm_results_lin_df = pd.DataFrame(svm_results_linear)
svm_results_lin_df.columns = results_column_names
results_summary['SVM_lin'] = svm_results_lin_df.mean()

svm_results_rbf_df = pd.DataFrame(svm_results_rbf)
svm_results_rbf_df.columns = results_column_names
results_summary['SVM_rbf'] = svm_results_rbf_df.mean()

lr_results_df = pd.DataFrame(lr_results)
lr_results_df.columns = results_column_names
results_summary['LR'] = lr_results_df.mean()

nn_results_df = pd.DataFrame(nn_results)
nn_results_df.columns = results_column_names
results_summary['Neural'] = nn_results_df.mean()


# %% Print summary results
print()
print('Results Summary:')
print(results_summary)

# %% Save files
# NOT USED IN THIS DEMO
# forest_results_df.to_csv('results/forest_results.csv')
# forest_importance_df.to_csv('results/forest_importance.csv')
# svm_results_lin_df.to_csv('results/svm_lin_results.csv')
# svm_results_rbf_df.to_csv('results/svm_rbf_results.csv')
# lr_results_df.to_csv('results/logistic_results.csv')
# nn_results_df.to_csv('results/neural_network_results.csv')
# results_summary.to_csv('results/results_summary.csv')
 Output:
Results Summary:
                             Forest   SVM_lin   SVM_rbf        LR    Neural
accuracy                   0.851946  0.839995  0.843081  0.839610  0.801859
sensitivity                0.727978  0.767511  0.741951  0.753473  0.702353
specificity                0.905567  0.871350  0.886804  0.876865  0.844867
positive likelihood        8.799893  7.396559  7.384775  7.390298  4.909178
negative likelihood        0.297522  0.263269  0.287613  0.276478  0.349459
false positive rate        0.094433  0.128650  0.113196  0.123135  0.155133
false negative rate        0.272022  0.232489  0.258049  0.246527  0.297647
positive predictive value  0.775919  0.731363  0.747641  0.737093  0.669270
negative predictive value  0.887152  0.898619  0.890479  0.894471  0.869677
positive rate              0.285678  0.321505  0.302999  0.313414  0.320310

68. Machine learning: Using regularisation to improve accuracy

plot_26

 

Many machine learning techniques include an option to fine-tune regularisation. Regularisation helps to avoid over-fitting of the model to the training set at the cost of accuracy of predication for previously unseen samples in the test set. In the logistic regression method that we have been looking at the regularisation term in the model fit is ’c’. The lower the c value the greater the regularisation. The previous code has been amended below to loop through a series of c values. For each value of c the model fit is run 100 times with different random train/test splits, and the average results are presented. Continue reading “68. Machine learning: Using regularisation to improve accuracy”

67. Machine learning: Adding standard diagnostic performance metrics to a ml diagnosis model

Machine learning diagnostic performance measures:
accuracy = 0.937
sensitivity = 0.933
specificity = 0.943
positive_likelihood = 16.489
negative_likelihood = 0.071
false_positive_rate = 0.057
false_negative_rate = 0.067
positive_predictive_value = 0.966
negative_predictive_value = 0.893
precision = 0.966
recall = 0.933
f1 = 0.949

 

Continue reading “67. Machine learning: Adding standard diagnostic performance metrics to a ml diagnosis model”

66. Machine learning. Your first ml model! Using logistic regression to diagnose breast cancer.

Here we will use the first of our machine learning algorithms to diagnose whether someone has a benign or malignant tumour. We are using a form of logistic regression. In common to many machine learning models it incorporates a regularisation term which sacrifices a little accuracy in predicting outcomes in the training set for improved accuracy in predicting the outcomes of patients not used in the training set. Continue reading “66. Machine learning. Your first ml model! Using logistic regression to diagnose breast cancer.”