110. TensorFlow text-based classification – from raw text to prediction

Download the py file from this here: tensorflow.py

If you need help installing TensorFlow, see our guide on installing and using a TensorFlow environment.

Below is a worked example that uses text to classify whether a movie reviewer likes a movie or not.

The code goes through the following steps:
1. import libraries
2. load data
3. clean data
4. convert words to numbers
5. process data for tensorflow
6. build model
7. train model
8. predict outcome (like movie or nor) for previously unseen reviews

Please also see the TensorFlow tutorials where the TensorFlow model building code came from:

https://www.tensorflow.org/tutorials/keras/basic_text_classification

https://www.tensorflow.org/tutorials/keras/overfit_and_underfit

"""
This example starts with with raw text (movie reviews) and predicts whether the 
reviewer liked the movie.

The code goes through the following steps:
    1. import libraries
    2. load data
    3. clean data
    4. convert words to numbers
    5. process data for tensorflow
    6. build model
    7. train model
    8. predict outcome (like movie or nor) for previously unseen reviews

For information on installing a tensorflow environment in Anaconda see:
https://pythonhealthcare.org/2018/12/19/106-installing-and-using-tensorflow-using-anaconda/

For installing anaconda see:
https://www.anaconda.com/download

We import necessary libraries.

If you are missing a library then if using Anaconda from a command line (after
activating tensorflow library) use:
    conda import library-name
        
If you find you missing a nltk download then from a command line (after
activating tensorflow library) use:
    python (to being command line python session)
    import nltk
    nltk.download(library name)
    or
    nltk.download() will open dialoge box where you can install any/all nltk
    libraries
"""

###############################################################################
############################## IMPORT LIBRARIES ############################### 
###############################################################################

import numpy as np
import pandas as pd
import nltk
import tensorflow as tf

from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from tensorflow import keras


# If not previously performed:
# nltk.download('stopwords')

###############################################################################
################################## LOAD DATA ################################## 
###############################################################################

"""
Here we load up a csv file. Each line contains a text string and then a label.
An example is given to download the imdb dataset which contains 50,000 movie
reviews. The label is 0 or 1 depending on whetehr the reviewer liked the movie.
"""
print ('Loading data')

# If you do not already have the data locally you may download (and save) by:
file_location = 'https://gitlab.com/michaelallen1966/00_python_snippets' +\     '_and_recipes/raw/master/machine_learning/data/IMDb.csv'
data = pd.read_csv(file_location)
# save to current directory
data.to_csv('imdb.csv', index=False)

# If you already have the data locally then you may run the following
# data = pd.read_csv('imdb.csv')

# Change headings of dataframe to make them more universal
data.columns=['text','label']

# We'll now hold back 5% of data for a final test that ha snot been used in
# training

number_of_records = data.shape[0]
number_to_hold_back = int(number_of_records * 0.05)
number_to_use = number_of_records - number_to_hold_back
data = data.head(number_to_use)
data_held_back = data.tail(number_to_hold_back)

###############################################################################
################################## CLEAN DATA ################################# 
###############################################################################

"""
Here we process the data in the following ways:
  1) change all text to lower case
  2) tokenize (breaks text down into a list of words)
  3) remove punctuation and non-word text
  4) find word stems (e.g. runnign, run and runner will be converted to run)
  5) removes stop words (commonly occuring words of little value, e.g. 'the')
"""

stemming = PorterStemmer()
stops = set(stopwords.words("english"))

def apply_cleaning_function_to_list(X):
    cleaned_X = []
    for element in X:
        cleaned_X.append(clean_text(element))
    return cleaned_X

def clean_text(raw_text):
    """This function works on a raw text string, and:
        1) changes to lower case
        2) tokenizes (breaks text down into a list of words)
        3) removes punctuation and non-word text
        4) finds word stems
        5) removes stop words
        6) rejoins meaningful stem words"""
    
    # Convert to lower case
    text = raw_text.lower()
    
    # Tokenize
    tokens = nltk.word_tokenize(text)
    
    # Keep only words (removes punctuation + numbers)
    # use .isalnum to keep also numbers
    token_words = [w for w in tokens if w.isalpha()]
    
    # Stemming
    stemmed_words = [stemming.stem(w) for w in token_words]
    
    # Remove stop words
    meaningful_words = [w for w in stemmed_words if not w in stops]
      
    # Return cleaned data
    return meaningful_words

print ('Cleaning text')
# Get text to clean
text_to_clean = list(data['text'])

# Clean text and add to data
data['cleaned_text'] = apply_cleaning_function_to_list(text_to_clean)

###############################################################################
######################## CONVERT WORDS TO NUMBERS ############################# 
###############################################################################

"""
The frequency of all words is counted. Words are then given an index number so
that th emost commonly occurring words hav ethe lowest number (so the 
dictionary may then be truncated at any point to keep the most common words).
We avoid using the index number zero as we will use that later to 'pad' out
short text.
"""

def training_text_to_numbers(text, cutoff_for_rare_words = 1):
    """Function to convert text to numbers. Text must be tokenzied so that
    test is presented as a list of words. The index number for a word
    is based on its frequency (words occuring more often have a lower index).
    If a word does not occur as many times as cutoff_for_rare_words,
    then it is given a word index of zero. All rare words will be zero.
    """

    # Flatten list if sublists are present
    if len(text) > 1:
        flat_text = [item for sublist in text for item in sublist]
    else:
        flat_text = text
    
    # get word freuqncy
    fdist = nltk.FreqDist(flat_text)

    # Convert to Pandas dataframe
    df_fdist = pd.DataFrame.from_dict(fdist, orient='index')
    df_fdist.columns = ['Frequency']

    # Sort by word frequency
    df_fdist.sort_values(by=['Frequency'], ascending=False, inplace=True)

    # Add word index
    number_of_words = df_fdist.shape[0]
    df_fdist['word_index'] = list(np.arange(number_of_words)+1)
    
    # Convert pandas to dictionary
    word_dict = df_fdist['word_index'].to_dict()
    
    # Use dictionary to convert words in text to numbers
    text_numbers = []
    for string in text:
        string_numbers = [word_dict[word] for word in string]
        text_numbers.append(string_numbers)
    
    return (text_numbers, df_fdist)

# Call function to convert training text to numbers
print ('Convert text to numbers')
numbered_text, dict_df = \
    training_text_to_numbers(data['cleaned_text'].values)

# Keep only word freqeuncies 1 to 10000
def limit_word_count(numbered_text):
    max_word_count = 10000
    filtered_text = []
    for number_list in numbered_text:
        filtered_line = \
            [number for number in number_list if number <=max_word_count]
        filtered_text.append(filtered_line)
        
    return filtered_text
    
data['numbered_text'] = limit_word_count(numbered_text)

# Pickle dataframe and dictionary dataframe (for later use if required)
data.to_pickle('data_numbered.p')
dict_df.to_pickle('data_dictionary_dataframe.p')

###############################################################################
######################### PROCESS DATA FOR TENSORFLOW ######################### 
###############################################################################

"""
Here we extract data from the pandas DataFrame, make all text vectors the same
length (by padding short texts and truncating long ones). We then split into
trainign and test data sets.
"""

print ('Process data for TensorFlow model')

# At this point pickled data (processed in an earlier run) might be loaded with 
# data=pd.read_pickle(file_name)
# dict_df=pd.read_pickle(filename)

# Get data from datframe and put in X and y lists
X = list(data.numbered_text.values)
y = data.label.values

## MAKE ALL X DATA THE SAME LENGTH
# We will use keras to make all X data a length of 512.
# Shorter data will be padded with 0, longer data will be truncated.
# We have oreviously kept the value zero free from use..

processed_X = \
    keras.preprocessing.sequence.pad_sequences(X,
                                               value=0,
                                               padding='post',
                                               maxlen=512)

## SPLIT DATA INTO TRAINIGN AND TEST SETS

X_train, X_test, y_train, y_test=train_test_split(
        processed_X,y,test_size=0.2,random_state=999)

###############################################################################
########################## BUILD MODEL AND OPTIMIZER ########################## 
###############################################################################

"""
Here we construct a four-layer neural network with keras/tensorflow.
The first layer is the input layer, then we have two hidden layers, and an
output layer.
"""

print ('Build model')

# BUILD MODEL

# input shape is the vocabulary count used for the text-to-numebr conversion 
# (10,000 words plus one for our zero padding)
vocab_size = 10001

###############################################################################
# Info on neural network layers
#
# The layers are stacked sequentially to build the classifier:
#
# The first layer is an Embedding layer. This layer takes the integer-encoded 
# vocabulary and looks up the embedding vector for each word-index. These 
# vectors are learned as the model trains. The vectors add a dimension to the 
# output array. The resulting dimensions are: (batch, sequence, embedding).
#
# Next, a GlobalAveragePooling1D layer returns a fixed-length output vector for
# each example by averaging over the sequence dimension. This allows the model 
# to handle input of variable length, in the simplest way possible.
#
# This fixed-length output vector is piped through a fully-connected (Dense) 
# layer with 16 hidden units.
#
# The last layer is densely connected with a single output node. Using the 
# sigmoid activation function, this value is a float between 0 and 1, 
# representing a probability, or confidence level.
#
# The regulaizers help prevent over-fitting. Overfitting is evident when the
# trainign data fit is significant better than the test data fit. The level and
# the type may be adjusted to maximise test accuracy
##############################################################################

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))        

model.add(keras.layers.GlobalAveragePooling1D())

model.add(keras.layers.Dense(16, activation=tf.nn.relu, 
                             kernel_regularizer=keras.regularizers.l2(0.01)))

model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid,
                             kernel_regularizer=keras.regularizers.l2(0.01)))

model.summary()

# CONFIGURE OPTIMIZER

model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='binary_crossentropy',
              metrics=['accuracy'])

###############################################################################
################################# TRAIN MODEL ################################# 
###############################################################################

"""
Here we train the model. Using more epochs may give higher accuracy.

In 'real life' you may wish to hold back other test data (e.g. 10% of the 
orginal data so that you may use the test set here to help optimise the 
neutral network parameters and then test the final model on an independent data
set.

When verbose is set to 1, the model will show accuracy and loss for training 
and test data sets

"""

print ('Train model')

# Train model (verbose = 1 shows training progress)
model.fit(X_train,
          y_train,
          epochs=100,
          batch_size=512,
          validation_data=(X_test, y_test),
          verbose=1)


results = model.evaluate(X_train, y_train)
print('\nTraining accuracy:', results[1])

results = model.evaluate(X_test, y_test)
print('\nTest accuracy:', results[1])

###############################################################################
######################### PREDICT RESULTS FOR NEW TEXT ######################## 
###############################################################################

"""
Here we make predictions from text that has never been applied before. As we
are using data that has been held back we may also check its accuracy against 
a known label
 """

print ('\nMake predictions')

# We held some data back from thr original test set
# We will first clean the text

text_to_clean = list(data_held_back['text'].values)
X_clean = apply_cleaning_function_to_list(text_to_clean)
 
# Now we need to convert words to numbers.
# As these are new data it is possible that the word is not recognized so we
# will check the word is in the dictionary

# Convert pandas dataframe to dictionary
word_dict = dict_df['word_index'].to_dict()

# Use dictionary to convert words in text to numbers
text_numbers = []
for string in X_clean:
    string_numbers = []
    for word in string:
        if word in word_dict:
            string_numbers.append(word_dict[word])
    text_numbers.append(string_numbers)

# Keep only the top 10,000 words
# The function is repeated here for clarity (but would not usually be repeated)  

def limit_word_count(numbered_text):
    max_word_count = 10000
    filtered_text = []
    for number_list in numbered_text:
        filtered_line = \
            [number for number in number_list if number <=max_word_count]
        filtered_text.append(filtered_line)
        
    return filtered_text
    
text_numbers = limit_word_count(text_numbers)

# Process into fixed length arrays
    
processed_X = \
    keras.preprocessing.sequence.pad_sequences(text_numbers,
                                               value=0,
                                               padding='post',
                                               maxlen=512)

# Get prediction
predicted_classes = model.predict_classes(processed_X)
# The predicted classes give 0/1 for each possible class. As we only have one
# class we need to 'flatten' this array to remove nesting
predicted_classes = predicted_classes.flatten()

# Check prediction against known label
actual_classes = data_held_back['label'].values
accurate_prediction = predicted_classes == actual_classes
accuracy = accurate_prediction.mean()
print ('Accuracy on unseen data: %.2f' %accuracy)

108. Converting text to numbers

Machine learning routines work on numbers rather text, so we may frequently have to convert our text to numbers. Below is a function for one of the simplest ways to convert text to numbers. Each word is given an index number (and here we give more frequent words lower index numbers).

This function uses ‘tokenized’ text – that is text that has been pre-processed into lists of words. Tokenization also usually involves other cleaning steps, such as converting all words to lower case and removing ‘stop words’, that is words such as ‘the’ that have little value in machine learning. If you need code for tokenization, please see here, though if all you need to do is the break a sentence into words then this may be done with:

import nltk
tokens = nltk.word_tokenize(text)

Here is the function to convert strings of tokenized text:

import nltk
import numpy as np
import pandas as pd

def text_to_numbers(text, cutoff_for_rare_words = 1):
    """Function to convert text to numbers. Text must be tokenzied so that
    test is presented as a list of words. The index number for a word
    is based on its frequency (words occuring more often have a lower index).
    If a word does not occur as many times as cutoff_for_rare_words,
    then it is given a word index of zero. All rare words will be zero.
    """
    
    # Flatten list if sublists are present
    if len(text) > 1:
        flat_text = [item for sublist in text for item in sublist]
    else:
        flat_text = text
    
    # get word freuqncy
    fdist = nltk.FreqDist(flat_text)

    # Convert to Pandas dataframe
    df_fdist = pd.DataFrame.from_dict(fdist, orient='index')
    df_fdist.columns = ['Frequency']

    # Sort by word frequency
    df_fdist.sort_values(by=['Frequency'], ascending=False, inplace=True)

    # Add word index
    number_of_words = df_fdist.shape[0]
    df_fdist['word_index'] = list(np.arange(number_of_words)+1)

    # replace rare words with index zero
    frequency = df_fdist['Frequency'].values
    word_index = df_fdist['word_index'].values
    mask = frequency <= cutoff_for_rare_words
    word_index[mask] = 0
    df_fdist['word_index'] =  word_index
    
    # Convert pandas to dictionary
    word_dict = df_fdist['word_index'].to_dict()
    
    # Use dictionary to convert words in text to numbers
    text_numbers = []
    for string in text:
        string_numbers = [word_dict[word] for word in string]
        text_numbers.append(string_numbers)  
    
    return (text_numbers)

Now let’s see the function in action.

# An example tokenised list

text = [['hello', 'world', 'Michael'],
         ['hello', 'world', 'sam'],
         ['hello', 'universe'],
         ['michael', 'makes', 'a', 'good', 'cup', 'of', 'tea'],
         ['tea', 'is', 'nice'],
         ['michael', 'is', 'nice']]

text_numbers = text_to_numbers(text)
print (text_numbers)

Out:

[[1, 2, 0], [1, 2, 0], [1, 0], [3, 0, 0, 0, 0, 0, 4], [4, 5, 6], [3, 5, 6]]

105: Topic modelling (dividing documents into topic groups) with Gensim

Gensim is a library that can sort documents into groups. It is an ‘unsupervised’ method, meaning that documents do not need to be pre-labelled.

Here we will use gensim to group titles or keywords from PubMed scientific paper references.

Gensim is not part of the standard Anaconda Python installation, but it may be installed from the command line with:

conda install gensim

If you are not using an Anaconda installation of Python then you can install with pip:

pip install gensim

Import libraries

import pandas as pd
import gensim
import nltk
from nltk.corpus import stopwords

Load data

Now we will load our data (the script below loads from a local copy of the imdb movie review database, but instructions are also given for downloading from the internet).

In this example we will use a portion of a large dataset of pubmed medical paper titles and keywords. The full data set may be downloaded from the link below (1.2GB download).

https://www.kaggle.com/hsrobo/titlebased-semantic-subject-indexing#pubmed.csv

The code section below downloads a 50k subset of the full data. It will download and save locally.

## LOAD 50k DATA SET FROM INTERNET

file_location = 'https://gitlab.com/michaelallen1966/1804_python_healthcare_wordpress' + \
    '/raw/master/jupyter_notebooks/pubmed_50k.csv'
data = pd.read_csv(file_location)
# save to current directory
data.to_csv('pubmed_50k.csv', index=False)


# If you already have the data locally, you may load with:
# data = pd.read_csv('pubmed_50k.csv')

Clean data

We will clean data by applying the following steps.

  • convert all text to lower case
  • divide strings/sentences into individual words (‘tokenize’)
  • remove non-text words
  • remove ‘stop words’ (commonly occurring words that have little value in model)

In the example here we will take the keywords (called #labels’ for each paper).

stops = set(stopwords.words("english"))

# Define function to clean text
def pre_process_text(X):
    cleaned_X = []
    for raw_text in X:
            # Convert to lower case
        text = raw_text.lower()

        # Tokenize
        tokens = nltk.word_tokenize(text)

        # Keep only words (removes punctuation + numbers)
        token_words = [w for w in tokens if w.isalpha()]

        # Remove stop words
        meaningful_words = [w for w in token_words if not w in stops]
        
        cleaned_X.append(meaningful_words)
    return cleaned_X


# Clean text
raw_text = list(data['labels'])
processed_text = pre_process_text(raw_text)

Create topic model

The following will create our topic model. We will divide the references into 50 different topic areas. This may take a few minutes.

dictionary = gensim.corpora.Dictionary(processed_text)
corpus = [dictionary.doc2bow(text) for text in processed_text]
model = gensim.models.LdaModel(corpus=corpus, 
                               id2word=dictionary,
                               num_topics=50,
                               passes=10)

Show topics

top_topics = model.top_topics(corpus)

When we look at the first topic, we see that keywords are largely related to molecular biology.

# Print the keywords for the first topic
from pprint import pprint # makes the output easier to read
pprint(top_topics[0])

Out:

([(0.14238602, 'sequence'),
  (0.07095096, 'molecular'),
  (0.068985716, 'acid'),
  (0.06632707, 'dna'),
  (0.052034967, 'amino'),
  (0.045958135, 'data'),
  (0.03165856, 'proteins'),
  (0.03090581, 'base'),
  (0.02128076, 'rna'),
  (0.018465547, 'genetic'),
  (0.01786064, 'bacterial'),
  (0.017836036, 'viral'),
  (0.014751778, 'animals'),
  (0.013531377, 'nucleic'),
  (0.013407393, 'genes'),
  (0.013192138, 'analysis'),
  (0.012144415, 'humans'),
  (0.011643986, 'cloning'),
  (0.011388958, 'phylogeny'),
  (0.011024448, 'protein')],
 -1.8900328727623419)

If we look at another topic (topic 10) we see keywords that are associated with cardiac surgical procedures.

pprint(top_topics[10])

Out:

([(0.056376483, 'outcome'),
  (0.0558772, 'treatment'),
  (0.05120605, 'humans'),
  (0.03075589, 'surgical'),
  (0.030704997, 'complications'),
  (0.028222634, 'postoperative'),
  (0.026755776, 'coronary'),
  (0.02651835, 'tomography'),
  (0.026010627, 'heart'),
  (0.023422625, 'male'),
  (0.022875749, 'computed'),
  (0.021913974, 'studies'),
  (0.02180667, 'procedures'),
  (0.019811377, 'myocardial'),
  (0.01757028, 'cardiac'),
  (0.015987962, 'artery'),
  (0.015796969, 'female'),
  (0.013579166, 'prosthesis'),
  (0.012545248, 'valve'),
  (0.01180034, 'history')],
 -3.1752669709698886)

Show topics present in each document

Each document may contain one or more topics. The first paper is highlighted as containing topics 4, 8, 19 and 40.

model[corpus[0]]

Out:

[(4, 0.0967338), (8, 0.27275458), (19, 0.4578644), (40, 0.09598056)]

104: Using free text for classification – ‘Bag of Words’

There may be times in healthcare where we would like to classify patients based on free text data we have for them. Maybe, for example, we would like to predict likely outcome based on free text clinical notes.

Using free text requires methods known as ‘Natural Language Processing’.

Here we start with one of the simplest techniques – ‘bag of words’.

In a ‘bag of words’ free text is reduced to a vector (a series of numbers) that represent the number of times a word is used in the text we are given. It is also posible to look at series of two, three or more words in case use of two or more words together helps to classify a patient.

A classic ‘toy problem’ used to help teach or develop methos is to try to judge whether people rates a film as ‘like’ or ‘did not like’ based on the free text they entered into a widely used internet film review database (www.imdb.com).

Here will will use 50,000 records from IMDb to convert each review into a ‘bag of words’, which we will then use in a simple logistic regression machine learning model.

We can use raw word counts, but in this case we’ll add an extra transformation called tf-idf (frequency–inverse document frequency) which adjusts values according to the number fo reviews that use the word. Words that occur across many reviews may be less discriminatory than words that occur more rarely, so tf-idf reduces the value of those words used frequently across reviews.

This code will take us through the following steps:

1) Load data from internet

2) Clean data – remove non-text, convert to lower case, reduce words to their ‘stems’ (see below for details), and reduce common ‘stop-words’ (such as ‘as’, ‘the’, ‘of’).

3) Split data into training and test data sets

4) Convert cleaned reviews in word vectors (‘bag of words’), and apply the tf-idf transform.

5) Train a logistic regression model on the tr-idf transformed word vectors.

6) Apply the logistic regression model to our previously unseen test cases, and calculate accuracy of our model

Load data

import pandas as pd

# If you do not already have the data locally you may download (and save) by

file_location = 'https://gitlab.com/michaelallen1966/00_python_snippets' +\
    '_and_recipes/raw/master/machine_learning/data/IMDb.csv'
imdb = pd.read_csv(file_location)
# save to current directory
imdb.to_csv('imdb.csv', index=False)

# If you already have the data locally then you may run the following

# Load data example
imdb = pd.read_csv('imdb.csv')

# Truncate data for example if you want to speed up the example
# imdb = imdb.head(5000)

Define Function to preprocess data

This function, as previously described, works on raw text strings, and:

1) changes to lower case

2) tokenizes (breaks down into words

3) removes punctuation and non-word text

4) finds word stems

5) removes stop words

6) rejoins meaningful stem words

import nltk
import pandas as pd
import numpy as np
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# If not previously performed:
# nltk.download('stopwords')

stemming = PorterStemmer()
stops = set(stopwords.words("english"))

def apply_cleaning_function_to_list(X):
    cleaned_X = []
    for element in X:
        cleaned_X.append(clean_text(element))
    return cleaned_X


def clean_text(raw_text):
    """This function works on a raw text string, and:
        1) changes to lower case
        2) tokenizes (breaks down into words
        3) removes punctuation and non-word text
        4) finds word stems
        5) removes stop words
        6) rejoins meaningful stem words"""
    
    # Convert to lower case
    text = raw_text.lower()
    
    # Tokenize
    tokens = nltk.word_tokenize(text)
    
    # Keep only words (removes punctuation + numbers)
    # use .isalnum to keep also numbers
    token_words = [w for w in tokens if w.isalpha()]
    
    # Stemming
    stemmed_words = [stemming.stem(w) for w in token_words]
    
    # Remove stop words
    meaningful_words = [w for w in stemmed_words if not w in stops]
    
    # Rejoin meaningful stemmed words
    joined_words = ( " ".join(meaningful_words))
    
    # Return cleaned data
    return joined_words

Apply the data cleaning function (this may take a few minutes if you are using the full 50,000 reviews).

# Get text to clean
text_to_clean = list(imdb['review'])

# Clean text
cleaned_text = apply_cleaning_function_to_list(text_to_clean)

# Add cleaned data back into DataFrame
imdb['cleaned_review'] = cleaned_text

# Remove temporary cleaned_text list (after transfer to DataFrame)
del cleaned_text

Split data into training and test data sets

from sklearn.model_selection import train_test_split
X = list(imdb['cleaned_review'])
y = list(imdb['sentiment'])
X_train, X_test, y_train, y_test = train_test_split(
    X,y, test_size = 0.25)

Create ‘bag of words’

The ‘bag of words’ is the word vector for each review. This may be a simple word count for each review where each position of the vector represents a word (returned in the ‘vocab’ list) and the value of that position represents the number of times that word is used in the review.

The function below also returns a tf-idf (frequency–inverse document frequency) which adjusts values according to the number fo reviews that use the word. Words that occur across many reviews may be less discriminatory than words that occur more rarely. The tf-idf transform reduces the value of a given word in proportion to the number of documents that it appears in.

The function returns the following:

1) vectorizer – this may be applied to any new reviews to convert the revier into the same word vector as the training set.

2) vocab – the list of words that the word vectors refer to.

3) train_data_features – raw word count vectors for each review

4) tfidf_features – tf-idf transformed word vectors

5) tfidf – the tf-idf transformation that may be applied to new reviews to convert the raw word counts into the transformed word counts in the same way as the training data.

Our vectorizer has an argument called ‘ngram_range’. A simple bag of words divides reviews into single words. If we have an ngram_range of (1,2) it means that the review is divided into single words and also pairs of consecutive words. This may be useful if pairs of words are useful, such as ‘very good’. The max_features argument limits the size of the word vector, in this case to a maximum of 10,000 words (or 10,000 ngrams of words if an ngram may be more than one word).

def create_bag_of_words(X):
    from sklearn.feature_extraction.text import CountVectorizer
    
    print ('Creating bag of words...')
    # Initialize the "CountVectorizer" object, which is scikit-learn's
    # bag of words tool.  
    
    # In this example features may be single words or two consecutive words
    # (as shown by ngram_range = 1,2)
    vectorizer = CountVectorizer(analyzer = "word",   \
                                 tokenizer = None,    \
                                 preprocessor = None, \
                                 stop_words = None,   \
                                 ngram_range = (1,2), \
                                 max_features = 10000
                                ) 

    # fit_transform() does two functions: First, it fits the model
    # and learns the vocabulary; second, it transforms our training data
    # into feature vectors. The input to fit_transform should be a list of 
    # strings. The output is a sparse array
    train_data_features = vectorizer.fit_transform(X)
    
    # Convert to a NumPy array for easy of handling
    train_data_features = train_data_features.toarray()
    
    # tfidf transform
    from sklearn.feature_extraction.text import TfidfTransformer
    tfidf = TfidfTransformer()
    tfidf_features = tfidf.fit_transform(train_data_features).toarray()

    # Get words in the vocabulary
    vocab = vectorizer.get_feature_names()
   
    return vectorizer, vocab, train_data_features, tfidf_features, tfidf

Apply our bag of words function to our training set.

vectorizer, vocab, train_data_features, tfidf_features, tfidf  = \
    create_bag_of_words(X_train)

We can create a DataFrame of our words and counts, so that we may sort and view them. The count and tfidf_features exist for each X (each review in this case) – here we will look at just the first review (index 0).

Note that the tfidf_features differ from the count; that is because of the adjustment for how commonly they occur across reviews.

(Try changing the sort to sort by tfidf_features).

bag_dictionary = pd.DataFrame()
bag_dictionary['ngram'] = vocab
bag_dictionary['count'] = train_data_features[0]
bag_dictionary['tfidf_features'] = tfidf_features[0]

# Sort by raw count
bag_dictionary.sort_values(by=['count'], ascending=False, inplace=True)
# Show top 10
print(bag_dictionary.head(10))

Out:

         ngram  count  tfidf_features
9320        wa      4        0.139373
5528      movi      3        0.105926
9728     whole      2        0.160024
3473    german      2        0.249079
6327      part      2        0.140005
293   american      1        0.089644
9409   wa kind      1        0.160155
9576      wast      1        0.087894
7380       saw      1        0.078477
7599      sens      1        0.085879

Training a machine learning model on the bag of words

Now we have transformed our free text reviews in vectors of numbers (representing words) we can apply many different machine learning techniques. Here will will use a relatively simple one, logistic regression.

We’ll set up a function to train a logistic regression model.

def train_logistic_regression(features, label):
    print ("Training the logistic regression model...")
    from sklearn.linear_model import LogisticRegression
    ml_model = LogisticRegression(C = 100,random_state = 0)
    ml_model.fit(features, label)
    print ('Finished')
    return ml_model

Now we will use the tf-idf tranformed word vectors to train the model (we could use the plain word counts contained in ‘train_data_features’ (rather than using ’tfidf_features’). We pass both the features and the known label corresponding to the review (the sentiment, either 0 or 1 depending on whether a person likes the film or not.

ml_model = train_logistic_regression(tfidf_features, y_train)

Applying the bag of words model to test reviews

We will now apply the bag of words model to test reviews, and assess the accuracy.

We’ll first apply our vectorizer to create a word vector for review in the test data set.

test_data_features = vectorizer.transform(X_test)
# Convert to numpy array
test_data_features = test_data_features.toarray()

As we are using the tf-idf transform, we’ll apply the tfid transformer so that word vectors are transformed in the same way as the training data set.

test_data_tfidf_features = tfidf.fit_transform(test_data_features)
# Convert to numpy array
test_data_tfidf_features = test_data_tfidf_features.toarray()

Now the bit that we really want to do – we’ll predict the sentiment of the all test reviews (and it’s just a single line of code!). Did they like the film or not?

predicted_y = ml_model.predict(test_data_tfidf_features)
correctly_identified_y = predicted_y == y_test
accuracy = np.mean(correctly_identified_y) * 100
print ('Accuracy = %.0f%%' %accuracy)

Out:

Accuracy = 87%

103: POS (Parts of Speech) tagging – labeling words as nouns, verbs, adjectives, etc.

POS tagging labels words by type of word, which may enhance the quality of information that may be extracted from a piece of text.

There are varying sets of tags, but the common universal set is:

ADJ: adjective
ADP: adposition (preopositions and postpositions)
ADV: adverb
AUX: auxiliary
CCONJ: coordinating conjunction
DET: determiner
INTJ: interjection
NOUN: noun
NUM: numeral
PRT: particle or other function words
PRON: pronoun
VERB: verb
X: other
.: Punctuation

Other, more granular sets of tags include those included in the Brown Corpus (a coprpus of text with tags). NLTK can convert more granular data sets to tagged sets.

The two most commonly used tagged corpus datasets in NLTK are Penn Treebank and Brown Corpus. Both take text from a wide range of sources and tag words.

Details of the brown corpus and Penn treebank tags may be found here:

An example of tagging from the Brown corpus, and conversion to the universal tag set

import nltk
# Download the brown corpus if it has not previously been downloaded
nltk.download('brown');

from nltk.corpus import brown
# Show a set of tagged words from the Brown corpus
print(brown.tagged_words()[20:40])

Out:

[('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.'), ('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD')]

Convert more granular brown tagging to universal tagging.

print(brown.tagged_words(tagset='universal')[20:40])

Out:

[('any', 'DET'), ('irregularities', 'NOUN'), ('took', 'VERB'), ('place', 'NOUN'), ('.', '.'), ('The', 'DET'), ('jury', 'NOUN'), ('further', 'ADV'), ('said', 'VERB'), ('in', 'ADP'), ('term-end', 'NOUN'), ('presentments', 'NOUN'), ('that', 'ADP'), ('the', 'DET'), ('City', 'NOUN'), ('Executive', 'ADJ'), ('Committee', 'NOUN'), (',', '.'), ('which', 'DET'), ('had', 'VERB')]

Details of the brown corpus tags may be found here:

https://en.wikipedia.org/wiki/Brown_Corpus

In the above example the brown tags NNS (plural noun), NN (singlular noun) and NN-TL (singluar noun found in a title) are all converted to the universal tag NOUN.

Use of tagging to distinguish between different meanings of the same word

Consider the two uses of the word ‘left’ in the sentence below:

text = "I left the hotel to go to the coffee shop which is on the left of the church"

Let’s look at how ‘left’ is tagged in the two sentences:

# Split text into words
tokens = nltk.word_tokenize(text)

print ('Word tags for text:', nltk.pos_tag(tokens, tagset="universal"))

OUT:

Word tags for text: [('I', 'PRON'), ('left', 'VERB'), ('the', 'DET'), ('hotel', 'NOUN'), ('to', 'PRT'), ('go', 'VERB'), ('to', 'PRT'), ('the', 'DET'), ('coffee', 'NOUN'), ('shop', 'NOUN'), ('which', 'DET'), ('is', 'VERB'), ('on', 'ADP'), ('the', 'DET'), ('left', 'NOUN'), ('of', 'ADP'), ('the', 'DET'), ('church', 'NOUN')]

‘The first use of ‘left’ has been identified as a verb, and the second use a noun.

So POS-tagging may be used to enhance simple text-based methods, by providing additional information about words taking into account the context of the word.

102: Pre-processing data: tokenization, stemming, and removal of stop words (compressed code)

In the previous code example (here) we went through each of the steps of cleaning text, showing what each step does. Below is compressed code that does the same, and can be applied to any list of text strings. Here we import the imdb data set, extract the review text and clean it, and put the cleaned reviews back into the imdb DataFrame.

import nltk
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# If not previously performed:
# nltk.download('stopwords')

stemming = PorterStemmer()
stops = set(stopwords.words("english"))

def apply_cleaning_function_to_list(X):
    cleaned_X = []
    for element in X:
        cleaned_X.append(clean_text(element))
    return cleaned_X


def clean_text(raw_text):
    """This function works on a raw text string, and:
        1) changes to lower case
        2) tokenizes (breaks down into words
        3) removes punctuation and non-word text
        4) finds word stems
        5) removes stop words
        6) rejoins meaningful stem words"""
    
    # Convert to lower case
    text = raw_text.lower()
    
    # Tokenize
    tokens = nltk.word_tokenize(text)
    
    # Keep only words (removes punctuation + numbers)
    # use .isalnum to keep also numbers
    token_words = [w for w in tokens if w.isalpha()]
    
    # Stemming
    stemmed_words = [stemming.stem(w) for w in token_words]
    
    # Remove stop words
    meaningful_words = [w for w in stemmed_words if not w in stops]
    
    # Rejoin meaningful stemmed words
    joined_words = ( " ".join(meaningful_words))
    
    # Return cleaned data
    return joined_words


### APPLY FUNCTIONS TO EXAMPLE DATA

# Load data example
imdb = pd.read_csv('imdb.csv')

# If you do not already have the data locally you may download (and save) by
# uncommenting and running the following lines

# file_location = 'https://gitlab.com/michaelallen1966/00_python_snippets' +\
#     '_and_recipes/raw/master/machine_learning/data/IMDb.csv'
# imdb = pd.read_csv(file_location)
# save to current directory
# imdb.to_csv('imdb.csv', index=False)

# Truncate data for example
imdb = imdb.head(100)

# Get text to clean
text_to_clean = list(imdb['review'])

# Clean text
cleaned_text = apply_cleaning_function_to_list(text_to_clean)

# Show first example
print ('Original text:',text_to_clean[0])
print ('\nCleaned text:', cleaned_text[0])

# Add cleaned data back into DataFrame
imdb['cleaned_review'] = cleaned_text


OUT:

Original text: I have no read the novel on which "The Kite Runner" is based. My wife and daughter, who did, thought the movie fell a long way short of the book, and I'm prepared to take their word for it. But, on its own, the movie is good -- not great but good. How accurately does it portray the havoc created by the Soviet invasion of Afghanistan? How convincingly does it show the intolerant Taliban regime that followed? I'd rate it C+ on the first and B+ on the second. The human story, the Afghan-American who returned to the country to rescue the son of his childhood playmate, is well done but it is on this count particularly that I'm told the book was far more convincing than the movie. The most exciting part of the film, however -- the kite contests in Kabul and, later, a mini-contest in California -- cannot have been equaled by the book. I'd wager money on that.

Cleaned text: read novel kite runner base wife daughter thought movi fell long way short book prepar take word movi good great good accur doe portray havoc creat soviet invas afghanistan convincingli doe show intoler taliban regim follow rate first second human stori return countri rescu son hi childhood playmat well done thi count particularli told book wa far convinc movi excit part film howev kite contest kabul later california equal book wager money

101: Pre-processing data: tokenization, stemming, and removal of stop words

Here we will look at three common pre-processing step sin natural language processing:

1) Tokenization: the process of segmenting text into words, clauses or sentences (here we will separate out words and remove punctuation).

2) Stemming: reducing related words to a common stem.

3) Removal of stop words: removal of commonly used words unlikely to be useful for learning.

We will load up 50,000 examples from the movie review database, imdb, and use the NLTK library for text pre-processing. The NLTK library comes with a standard Anaconda Python installation (www.anaconda.com), but we will need to use it to install the ‘stopwords’ corpus of words.

Downloading the NLTK library

This command will open the NLTK downloader. You may download everything from the collections tab. Otherwise, for this example you may just download ‘stopwords’ from the ‘Corpora’ tab.

import nltk
To open dialog download:
nltk.download();
To downlaod just stopwords:
nltk.download('stopwords');

Load data

If you have not previously loaded and saved the imdb data, run the following which will load the file from the internet and save it locally to the same location this is code is run from.

We will load data into a pandas DataFrame.

import pandas as pd
file_location = 'https://gitlab.com/michaelallen1966/00_python_snippets' +\
'_and_recipes/raw/master/machine_learning/data/IMDb.csv'
imdb = pd.read_csv(file_location)
save to current directory
imdb.to_csv('imdb.csv', index=False)

If you have already saved the data locally, load it up into memory:

import pandas as pd

imdb = pd.read_csv('imdb.csv')

Let’s look at what columns exist in the imdb data:

print (list(imdb))

Out:
['review', 'sentiment']

We’ll pull out the first review and sentiment to look at the contents. The review is text and the sentiment label is either 0 (negative) or 1 (positive) based on how the reviewer rated it on imdb.

We will convert all text to lower case.

imdb['review'] = imdb['review'].str.lower()

We’ll pull out the first review and sentiment to look at the contents. The review is text and the sentiment label is either 0 (negative) or 1 (positive) based on how the reviewer rated it on imdb.

example_review = imdb.iloc[0]
print(example_review['review'])

Out:
i have no read the novel on which "the kite runner" is based. my wife and daughter, who did, thought the movie fell a long way short of the book, and i'm prepared to take their word for it. but, on its own, the movie is good -- not great but good. how accurately does it portray the havoc created by the soviet invasion of afghanistan? how convincingly does it show the intolerant taliban regime that followed? i'd rate it c+ on the first and b+ on the second. the human story, the afghan-american who returned to the country to rescue the son of his childhood playmate, is well done but it is on this count particularly that i'm told the book was far more convincing than the movie. the most exciting part of the film, however -- the kite contests in kabul and, later, a mini-contest in california -- cannot have been equaled by the book. i'd wager money on that.


print(example_review['sentiment'])

Out:
1

Tokenization

We will use word_tokenize method from NLTK to split the review text into individual words (and you will see that punctuation is also produced as separate ‘words’).Let’s look at our example row.

import nltk
print (nltk.word_tokenize(example_review['review']))

Out:
['i', 'have', 'no', 'read', 'the', 'novel', 'on', 'which', '``', 'the', 'kite', 'runner', "''", 'is', 'based', '.', 'my', 'wife', 'and', 'daughter', ',', 'who', 'did', ',', 'thought', 'the', 'movie', 'fell', 'a', 'long', 'way', 'short', 'of', 'the', 'book', ',', 'and', 'i', "'m", 'prepared', 'to', 'take', 'their', 'word', 'for', 'it', '.', 'but', ',', 'on', 'its', 'own', ',', 'the', 'movie', 'is', 'good', '--', 'not', 'great', 'but', 'good', '.', 'how', 'accurately', 'does', 'it', 'portray', 'the', 'havoc', 'created', 'by', 'the', 'soviet', 'invasion', 'of', 'afghanistan', '?', 'how', 'convincingly', 'does', 'it', 'show', 'the', 'intolerant', 'taliban', 'regime', 'that', 'followed', '?', 'i', "'d", 'rate', 'it', 'c+', 'on', 'the', 'first', 'and', 'b+', 'on', 'the', 'second', '.', 'the', 'human', 'story', ',', 'the', 'afghan-american', 'who', 'returned', 'to', 'the', 'country', 'to', 'rescue', 'the', 'son', 'of', 'his', 'childhood', 'playmate', ',', 'is', 'well', 'done', 'but', 'it', 'is', 'on', 'this', 'count', 'particularly', 'that', 'i', "'m", 'told', 'the', 'book', 'was', 'far', 'more', 'convincing', 'than', 'the', 'movie', '.', 'the', 'most', 'exciting', 'part', 'of', 'the', 'film', ',', 'however', '--', 'the', 'kite', 'contests', 'in', 'kabul', 'and', ',', 'later', ',', 'a', 'mini-contest', 'in', 'california', '--', 'can', 'not', 'have', 'been', 'equaled', 'by', 'the', 'book', '.', 'i', "'d", 'wager', 'money', 'on', 'that', '.']

We will now apply the word_tokenize to all records, making a new column in our imdb DataFrame. Each entry will be a list of words. Here we will also strip out non alphanumeric words/characters (such as numbers and punctuation) using .isalpha (you could use .isalnum if you wanted to keep in numbers as well).

def identify_tokens(row):
    review = row['review']
    tokens = nltk.word_tokenize(review)
    # taken only words (not punctuation)
    token_words = [w for w in tokens if w.isalpha()]
    return token_words

imdb['words'] = imdb.apply(identify_tokens, axis=1)

Stemming

Stemming reduces related words to a common stem. It is an optional process step, and it it is useful to test accuracy with and without stemming. Let’s look at an example.

from nltk.stem import PorterStemmer
stemming = PorterStemmer()

my_list = ['frightening', 'frightened', 'frightens']

# Using a Python list comprehension method to apply to all words in my_list

print ([stemming.stem(word) for word in my_list])


Out:
['frighten', 'frighten', 'frighten']

To apply this to all rows in our imdb DataFrame we will again define a function and apply it to our DataFrame.

def stem_list(row):
    my_list = row['words']
    stemmed_list = [stemming.stem(word) for word in my_list]
    return (stemmed_list)

imdb['stemmed_words'] = imdb.apply(stem_list, axis=1)

Lets check our stemmed words (using pandas DataFrame .iloc method to select the first row).

['i', 'have', 'no', 'read', 'the', 'novel', 'on', 'which', 'the', 'kite', 'runner', 'is', 'base', 'my', 'wife', 'and', 'daughter', 'who', 'did', 'thought', 'the', 'movi', 'fell', 'a', 'long', 'way', 'short', 'of', 'the', 'book', 'and', 'i', 'prepar', 'to', 'take', 'their', 'word', 'for', 'it', 'but', 'on', 'it', 'own', 'the', 'movi', 'is', 'good', 'not', 'great', 'but', 'good', 'how', 'accur', 'doe', 'it', 'portray', 'the', 'havoc', 'creat', 'by', 'the', 'soviet', 'invas', 'of', 'afghanistan', 'how', 'convincingli', 'doe', 'it', 'show', 'the', 'intoler', 'taliban', 'regim', 'that', 'follow', 'i', 'rate', 'it', 'on', 'the', 'first', 'and', 'on', 'the', 'second', 'the', 'human', 'stori', 'the', 'who', 'return', 'to', 'the', 'countri', 'to', 'rescu', 'the', 'son', 'of', 'hi', 'childhood', 'playmat', 'is', 'well', 'done', 'but', 'it', 'is', 'on', 'thi', 'count', 'particularli', 'that', 'i', 'told', 'the', 'book', 'wa', 'far', 'more', 'convinc', 'than', 'the', 'movi', 'the', 'most', 'excit', 'part', 'of', 'the', 'film', 'howev', 'the', 'kite', 'contest', 'in', 'kabul', 'and', 'later', 'a', 'in', 'california', 'can', 'not', 'have', 'been', 'equal', 'by', 'the', 'book', 'i', 'wager', 'money', 'on', 'that']

Removing stop words

‘Stop words’ are commonly used words that are unlikely to have any benefit in natural language processing. These includes words such as ‘a’, ‘the’, ‘is’.

As before we will define a function and apply it to our DataFrame.

We create a set of words that we will call ‘stops’ (using a set helps to speed up removing stop words).

from nltk.corpus import stopwords
stops = set(stopwords.words("english"))                  

def remove_stops(row):
    my_list = row['stemmed_words']
    meaningful_words = [w for w in my_list if not w in stops]
    return (meaningful_words)

imdb['stem_meaningful'] = imdb.apply(remove_stops, axis=1)

Show the stemmed words, without stop words, from the first record.

print(imdb['stem_meaningful'][0])

Out:
['read', 'novel', 'kite', 'runner', 'base', 'wife', 'daughter', 'thought', 'movi', 'fell', 'long', 'way', 'short', 'book', 'prepar', 'take', 'word', 'movi', 'good', 'great', 'good', 'accur', 'doe', 'portray', 'havoc', 'creat', 'soviet', 'invas', 'afghanistan', 'convincingli', 'doe', 'show', 'intoler', 'taliban', 'regim', 'follow', 'rate', 'first', 'second', 'human', 'stori', 'return', 'countri', 'rescu', 'son', 'hi', 'childhood', 'playmat', 'well', 'done', 'thi', 'count', 'particularli', 'told', 'book', 'wa', 'far', 'convinc', 'movi', 'excit', 'part', 'film', 'howev', 'kite', 'contest', 'kabul', 'later', 'california', 'equal', 'book', 'wager', 'money']

Rejoin words

Now we will rejoin our meaningful stemmed words into a single string.

def rejoin_words(row):
    my_list = row['stem_meaningful']
    joined_words = ( " ".join(my_list))
    return joined_words

imdb['processed'] = imdb.apply(rejoin_words, axis=1)

Save processed data

Now we’ll save our processed data as a csv. We’ll drop the intermediate columns in our Pandas DataFrame.

cols_to_drop = ['Unnamed: 0', 'review', 'words', 'stemmed_words', 'stem_meaningful']
imdb.drop(cols_to_drop, inplace=True)

imdb.to_csv('imdb_processed.csv', index=False)

85. Using free text for classification – ‘Bag of Words’

There may be times in healthcare where we would like to classify patients based on free text data we have for them. Maybe, for example, we would like to predict likely outcome based on free text clinical notes.

Using free text requires methods known as ‘Natural Language Processing’.

Here we start with one of the simplest techniques – ‘bag of words’.

In a ‘bag of words’ free text is reduced to a vector (a series of numbers) that represent the number of times a word is used in the text we are given. It is also posible to look at series of two, three or more words in case use of two or more words together helps to classify a patient.

A classic ‘toy problem’ used to help teach or develop methos is to try to judge whether people rates a film as ‘like’ or ‘did not like’ based on the free text they entered into a widely used internet film review database (www.imdb.com).

Here will will use 50,000 records from IMDb to convert each review into a ‘bag of words’, which we will then use in a simple logistic regression machine learning model.

We can use raw word counts, but in this case we’ll add an extra transformation called tf-idf (frequency–inverse document frequency) which adjusts values according to the number fo reviews that use the word. Words that occur across many reviews may be less discriminatory than words that occur more rarely, so tf-idf reduces the value of those words used frequently across reviews.

This code will take us through the following steps:

1) Load data from internet, and split into training and test sets.

2) Clean data – remove non-text, convert to lower case, reduce words to their ‘stems’ (see below for details), and reduce common ‘stop-words’ (such as ‘as’, ‘the’, ‘of’).

3) Convert cleaned reviews in word vectors (‘bag of words’), and apply the tf-idf transform.

4) Train a logistic regression model on the tr-idf transformed word vectors.

5) Apply the logistic regression model to our previously unseen test cases, and calculate accuracy of our model.

Load data

This will load the IMDb data from the web. It is loaded into a Pandas DataFrame.

import pandas as pd
import numpy as np

file_location = 'https://gitlab.com/michaelallen1966/00_python_snippets_and_recipes/raw/master/machine_learning/data/IMDb.csv'
data = pd.read_csv(file_location)

Show the size of the data set (rows, columns).

data.shape
Out:
(50000, 2)

Show the data fields.

list(data)
Out:
['review', 'sentiment']

Show the first record review and recorded sentiment (which will be 0 for not liked, or 1 for liked)

In [11
print ('Review:')
print (data['review'].iloc[0])
print ('\nSentiment (label):')
print (data['sentiment'].iloc[0])
  Out:
Review:
I have no read the novel on which "The Kite Runner" is based. My wife and daughter, who did, thought the movie fell a long way short of the book, and I'm prepared to take their word for it. But, on its own, the movie is good -- not great but good. How accurately does it portray the havoc created by the Soviet invasion of Afghanistan? How convincingly does it show the intolerant Taliban regime that followed? I'd rate it C+ on the first and B+ on the second. The human story, the Afghan-American who returned to the country to rescue the son of his childhood playmate, is well done but it is on this count particularly that I'm told the book was far more convincing than the movie. The most exciting part of the film, however -- the kite contests in Kabul and, later, a mini-contest in California -- cannot have been equaled by the book. I'd wager money on that.

Sentiment (label):
1

Splitting the data into training and test sets

Split the data into 70% training data and 30% test data. The model will be trained using the training data, and accuracy will be tested using the independent test data.

from sklearn.model_selection import train_test_split
X = list(data['review'])
y = list(data['sentiment'])
X_train, X_test, y_train, y_test = train_test_split(
    X,y, test_size = 0.3, random_state = 0)

Defining a function to clean the text

This function will:

1) Remove ant HTML commands in the text

2) Remove non-letters (e.g. punctuation)

3) Convert all words to lower case

4) Remove stop words (stop words are commonly used works like ‘and’ and ‘the’ which have little value in nag of words. If stop words are not already installed then open a python terminal and type the two following lines of code (these instructions will also be given when running this code if the stopwords have not already been downloaded onto the computer running this code).

import nltk

nltk.download(“stopwords”)

5) Reduce words to stem of words (e.g. ‘runner’, ‘running’, and ‘runs’ will all be converted to ‘run’)

6) Join words back up into a single string

def clean_text(raw_review):
    # Function to convert a raw review to a string of words
    
    # Import modules
    from bs4 import BeautifulSoup
    import re

    # Remove HTML
    review_text = BeautifulSoup(raw_review, 'lxml').get_text() 

    # Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    
    # Convert to lower case, split into individual words
    words = letters_only.lower().split()   

    # Remove stop words (use of sets makes this faster)
    from nltk.corpus import stopwords
    stops = set(stopwords.words("english"))                  
    meaningful_words = [w for w in words if not w in stops]                             

    # Reduce word to stem of word
    from nltk.stem.porter import PorterStemmer
    porter = PorterStemmer()
    stemmed_words = [porter.stem(w) for w in meaningful_words]

    # Join the words back into one string separated by space
    joined_words = ( " ".join( stemmed_words ))
    return joined_words 

Now will will define a function that will apply the cleaning function to a series of records (the clean text function works on one string of text at a time).

def apply_cleaning_function_to_series(X):
    print('Cleaning data')
    cleaned_X = []
    for element in X:
        cleaned_X.append(clean_text(element))
    print ('Finished')
    return cleaned_X

We will call the cleaning functions to clean the text of both the training and the test data. This may take a little time.

X_train_clean = apply_cleaning_function_to_series(X_train)
X_test_clean = apply_cleaning_function_to_series(X_test)
  Out:
Cleaning data
Finished
Cleaning data
Finished

Create ‘bag of words’

The ‘bag of words’ is the word vector for each review. This may be a simple word count for each review where each position of the vector represnts a word (returned in the ‘vocab’ list) and the value of that position represents the number fo times that word is used in the review.

The function below also returns a tf-idf (frequency–inverse document frequency) which adjusts values according to the number fo reviews that use the word. Words that occur across many reviews may be less discriminatory than words that occur more rarely. The tf-idf transorm reduces the value of a given word in proportion to the number of documents that it appears in.

The function returns the following:

1) vectorizer – this may be applied to any new reviews to convert the revier into the same word vector as the training set.

2) vocab – the list of words that the word vectors refer to.

3) train_data_features – raw word count vectors for each review

4) tfidf_features – tf-idf transformed word vectors

5) tfidf – the tf-idf transformation that may be applied to new reviews to convert the raw word counts into the transformed word counts in the same way as the training data.

Our vectorizer has an argument called ‘ngram_range’. A simple bag of words divides reviews into single words. If we have an ngram_range of (1,2) it means that the review is divided into single words and also pairs of consecutiev words. This may be useful if pairs of words are useful, such as ‘very good’. The max_features argument limits the size of the word vector, in this case to a maximum of 10,000 words (or 10,000 ngrams of words if an ngram may be mor than one word).

def create_bag_of_words(X):
    from sklearn.feature_extraction.text import CountVectorizer
    
    print ('Creating bag of words...')
    # Initialize the "CountVectorizer" object, which is scikit-learn's
    # bag of words tool.  
    
    # In this example features may be single words or two consecutive words
    vectorizer = CountVectorizer(analyzer = "word",   \
                                 tokenizer = None,    \
                                 preprocessor = None, \
                                 stop_words = None,   \
                                 ngram_range = (1,2), \
                                 max_features = 10000) 

    # fit_transform() does two functions: First, it fits the model
    # and learns the vocabulary; second, it transforms our training data
    # into feature vectors. The input to fit_transform should be a list of 
    # strings. The output is a sparse array
    train_data_features = vectorizer.fit_transform(X)
    
    # Convert to a NumPy array for easy of handling
    train_data_features = train_data_features.toarray()
    
    # tfidf transform
    from sklearn.feature_extraction.text import TfidfTransformer
    tfidf = TfidfTransformer()
    tfidf_features = tfidf.fit_transform(train_data_features).toarray()

    # Take a look at the words in the vocabulary
    vocab = vectorizer.get_feature_names()
   
    return vectorizer, vocab, train_data_features, tfidf_features, tfidf

We will apply our bag_of_words function to our training set. Again this might take a little time.

vectorizer, vocab, train_data_features, tfidf_features, tfidf  = (
        create_bag_of_words(X_train_clean))
  Out:
Creating bag of words...

Let’s look at the some items from the vocab list (positions 40-44). Some of the words may seem odd. That is because of the stemming.

vocab[40:45]
Out:
['accomplish', 'accord', 'account', 'accur', 'accuraci']

And we can see the raw word count represented in train_data_features.

train_data_features[0][40:45]
Out:
array([0, 0, 1, 0, 0], dtype=int64)

If we look at the tf-idf transform we can see the value reduced (words occuring in many documents will have their value reduced the most)

tfidf_features[0][40:45]
Out:
array([0.        , 0.        , 0.06988648, 0.        , 0.        ])

Training a machine learning model on the bag of words

Now we have transformed our free text reviews in vectors of numebrs (representing words) we can apply many different machine learning techniques. Here will will use a relatively simple one, logistic regression.

We’ll set up a function to train a logistic regression model.

def train_logistic_regression(features, label):
    print ("Training the logistic regression model...")
    from sklearn.linear_model import LogisticRegression
    ml_model = LogisticRegression(C = 100,random_state = 0)
    ml_model.fit(features, label)
    print ('Finished')
    return ml_model

Now we will use the tf-idf tranformed word vectors to train the model (we could use the plain word counts contained in ‘train_data_features’ (rather than using’tfidf_features’). We pass both the features and the known label corresponding to the review (the sentiment, either 0 or 1 depending on whether a person likes the film or not.

ml_model = train_logistic_regression(tfidf_features, y_train)
  Out:
Training the logistic regression model...
Finished

Applying the bag of words model to test reviews

We will now apply the bag of words model to test reviews, and assess the accuracy.

We’ll first apply our vectorizer to create a word vector for review in the test data set.

test_data_features = vectorizer.transform(X_test_clean)
# Convert to numpy array
test_data_features = test_data_features.toarray()

As we are using the tf-idf transform, we’ll apply the tfid transformer so that word vectors are transformed in the same way as the training data set.

test_data_tfidf_features = tfidf.fit_transform(test_data_features)
# Convert to numpy array
test_data_tfidf_features = test_data_tfidf_features.toarray()

Now the bit that we really want to do – we’ll predict the sentiment of the all test reviews (and it’s just a single line of code!). Did they like the film or not?

predicted_y = ml_model.predict(test_data_tfidf_features)

Now we’ll compare the predicted sentiment to the actual sentiment, and show the overall accuracy of this model.

correctly_identified_y = predicted_y == y_test
accuracy = np.mean(correctly_identified_y) * 100
print ('Accuracy = %.0f%%' %accuracy)
 Out:
Accuracy = 87%

87% accuracy. That’s not bad for a simple Natural Language Processing model, using logistic regression.