115. A short function to replace (impute) missing numerical data in Pandas DataFrames with median of column values

When we import data into NumPy or Pandas, any empty cells of numerical data will be labelled np.NaN on import. In techniques such as machine learning we may wish to either 1) remove rows with any missing data, or 2) fill in the missing data with a set value, often the median of all other values in that data column. The latter has an advantage that the technique can be used both in training the machine learning model, and in predicting output when we are given examples with some missing data.

Here we define a function that goes through data columns in a Pandas DataFrame, looks to see if there is any missing data and, of there is, replaces np.NaN with the median of all other values in that data column.

import pandas as pd
import numpy as np

def impute_with_median (df):
    """Iterate through columns of Pandas DataFrame.
    Where NaNs exist replace with median"""
    
    # Get list of DataFrame column names
    cols = list(df)
    # Loop through columns
    for column in cols:
        # Transfer column to independent series
        col_data = df[column]
        # Look to see if there is any missing numerical data
        missing_data = sum(col_data.isna())
        if missing_data > 0:
            # Get median and replace missing numerical data with median
            col_median = col_data.median()
            col_data.fillna(col_median, inplace=True)
            df[column] = col_data
    return df   

We will mimic importing data with missing numerical data.

name = ['Bob', 'Jim', 'Anne', 'Rosie', 'Ben', 'Tom']
colour = ['red', 'red', 'red', 'blue', 'red', 'blue']
age = [23, 45, np.NaN, 21, 18, 20]
height = [1.80, np.NaN, 1.65, 1.71, 1.61, 1.76] 

data =pd.DataFrame()
data['name'] = name
data['colour'] = colour
data['age'] = age
data['height'] = height

View the data with missing values.

print (data)

Out:

 	name 	colour 	age 	height
0 	Bob 	red 	23.0 	1.80
1 	Jim 	red 	45.0 	NaN
2 	Anne 	red 	NaN 	1.65
3 	Rosie 	blue 	21.0 	1.71
4 	Ben 	red 	18.0 	1.61
5 	Tom 	blue 	20.0 	1.76

Call the function to replace missing data with the median, and re-examine data.

data = impute_with_median(data)
print (data)

Out:

 	name 	colour 	age 	height
0 	Bob 	red 	23.0 	1.80
1 	Jim 	red 	45.0 	1.71
2 	Anne 	red 	21.0 	1.65
3 	Rosie 	blue 	21.0 	1.71
4 	Ben 	red 	18.0 	1.61
5 	Tom 	blue 	20.0 	1.76

112. Splitting data set into training and test sets using Pandas DataFrames methods

Note: this may also be performed using SciKit-Learn train_test_split method, but here we will use native Pandas methods.

Create a DataFrame

# Create pandas data frame

import pandas as pd

name = ['Sam', 'Bill', 'Bob', 'Ian', 'Jo', 'Anne', 'Carl', 'Toni']
age = [22, 34, 18, 34, 76, 54, 21, 8]
gender = ['f', 'm', 'm', 'm', 'f', 'f', 'm', 'f']
height = [1.64, 1.85, 1.70, 1.75, 1.63, 1.79, 1.70, 1.68]
passed_physical = [0, 1, 1, 1, 0, 1, 1, 0]

people = pd.DataFrame()
people['name'] = name
people['age'] = age
people['gender'] = gender
people['height'] = height
people['passed'] = passed_physical

print(people)

Out:

   name  age gender  height  passed
0   Sam   22      f    1.64       0
1  Bill   34      m    1.85       1
2   Bob   18      m    1.70       1
3   Ian   34      m    1.75       1
4    Jo   76      f    1.63       0
5  Anne   54      f    1.79       1
6  Carl   21      m    1.70       1
7  Toni    8      f    1.68       0

Split training and test sets

Here we take a random sample (25%) of rows and remove them from the original data by dropping index values.

# Create a copy of the DataFrame to work from
# Omit random state to have different random split each run

people_copy = people.copy()
train_set = people_copy.sample(frac=0.75, random_state=0)
test_set = people_copy.drop(train_set.index)

print ('Training set')
print (train_set)
print ('\nTest set')
print (test_set)
print ('\nOriginal DataFrame')
print (people)

Out:

Training set
   name  age gender  height  passed
6  Carl   21      m    1.70       1
2   Bob   18      m    1.70       1
1  Bill   34      m    1.85       1
7  Toni    8      f    1.68       0
3   Ian   34      m    1.75       1
0   Sam   22      f    1.64       0

Test set
   name  age gender  height  passed
4    Jo   76      f    1.63       0
5  Anne   54      f    1.79       1

Original DataFrame
   name  age gender  height  passed
0   Sam   22      f    1.64       0
1  Bill   34      m    1.85       1
2   Bob   18      m    1.70       1
3   Ian   34      m    1.75       1
4    Jo   76      f    1.63       0
5  Anne   54      f    1.79       1
6  Carl   21      m    1.70       1
7  Toni    8      f    1.68       0

Use ‘pop’ to extract the labels

‘Pop’ will remove a column from the DataFrame, and transfer it to a new variable.

train_set_labels = train_set.pop('passed')
test_set_labels = test_set.pop('passed')

Training set
   name  age gender  height
6  Carl   21      m    1.70
2   Bob   18      m    1.70
1  Bill   34      m    1.85
7  Toni    8      f    1.68
3   Ian   34      m    1.75
0   Sam   22      f    1.64

Out:

Training set
   name  age gender  height
6  Carl   21      m    1.70
2   Bob   18      m    1.70
1  Bill   34      m    1.85
7  Toni    8      f    1.68
3   Ian   34      m    1.75
0   Sam   22      f    1.64

Training set label (y)
6    1
2    1
1    1
7    0
3    1
0    0
Name: passed, dtype: int64

111. Using ‘pop’ to remove a Pandas DataFrame column and transfer to a new variable

Sometimes we may want to remove a column from a DataFrame, but at the same time transfer that column to a new variable to perform some work on it. An example is re-coding a column as shown below where we will convert a text male/female column into a number 0/1 male column.

Create a DataFrame

import pandas as pd

name = ['Sam', 'Bill', 'Bob', 'Ian', 'Jo', 'Anne', 'Carl', 'Toni']
age = [22, 34, 18, 34, 76, 54, 21, 8]
gender = ['f', 'm', 'm', 'm', 'f', 'f', 'm', 'f']
height = [1.64, 1.85, 1.70, 1.75, 1.63, 1.79, 1.70, 1.68]

people = pd.DataFrame()
people['name'] = name
people['age'] = age
people['gender'] = gender
people['height'] = height

print(people)

Out:

   name  age gender  height
0   Sam   22      f    1.64
1  Bill   34      m    1.85
2   Bob   18      m    1.70
3   Ian   34      m    1.75
4    Jo   76      f    1.63
5  Anne   54      f    1.79
6  Carl   21      m    1.70
7  Toni    8      f    1.68

Pop a column (to code differently)

# Pop column
people_gender = people.pop('gender') # extracts and removes gender

# Recode (using == gves True/False, but in Python that also has numerical values of 1/0)

male = (people_gender == 'm') * 1 # 'm' is true is converted to number

# Put new column into DataFrame and print
people['male'] = male
print (people)

Out:

   name  age  height  male
0   Sam   22    1.64     0
1  Bill   34    1.85     1
2   Bob   18    1.70     1
3   Ian   34    1.75     1
4    Jo   76    1.63     0
5  Anne   54    1.79     0
6  Carl   21    1.70     1
7  Toni    8    1.68     0

109. Saving intact Pandas DataFrames using ‘pickle’

Sometimes a DataFrame may have content in it that will not save well in text (e.g. csv) format. For example a DataFrame may contain lists, and these will be saved as a text string in a text format.

Python has a library (pickle) for saving Python objects intact so that they may saved and loaded without having to generate them again.

Pandas has built in ‘pckling’ capability which makes it very easy to save and load intact dataframes.

Let’s first generate a dataframe that contains lists.

import pandas as pd

my_df = pd.DataFrame()

names = ['Bob', 'Sam', 'Jo', 'Bill']

favourite_sports = [['Tennis', 'Motorsports'],
                   ['Football', 'Rugby', 'Hockey'],
                   ['Table tennis', 'Swimming', 'Athletics'],
                   ['Eating cheese']]

my_df['name'] = names
my_df['favourite_sport'] = favourite_sports

print(my_df)

Out:

   name                      favourite_sport
0   Bob                [Tennis, Motorsports]
1   Sam            [Football, Rugby, Hockey]
2    Jo  [Table tennis, Swimming, Athletics]
3  Bill                      [Eating cheese]

Save and load DataFrame using Pandas built in pickle methods (recommended)

# Save DataFrame to pickle object
my_df.to_pickle('test_df.p')

# Load DataFrame with pickle object
test_df_load_1 = pd.read_pickle('test_df.p')

print (test_df_load_1)

Out:

   name                      favourite_sport
0   Bob                [Tennis, Motorsports]
1   Sam            [Football, Rugby, Hockey]
2    Jo  [Table tennis, Swimming, Athletics]
3  Bill                      [Eating cheese]

Save and load DataFrame using standard Python pickle library

With DataFrames you will probably always want to use the df.ti_pickle and pd.read_pickle methods for ease. But below is an example of using the Python pickle library – this method can be used with other types of complex Python objects (such as trained machine learning models) as well.

import pickle

# Save using pickle 
# (the b in rb denotes binary mode which is required for more complex obecjtes)
filename = 'pickled_df.p'
with open(filename, 'wb') as filehandler:
    pickle.dump(my_df, filehandler)

# Load using pickle
filename = 'pickled_df.p'
with open(filename, 'rb') as filehandler: 
    reloaded_df = pickle.load(filehandler)

print (reloaded_df)

Out:

   name                      favourite_sport
0   Bob                [Tennis, Motorsports]
1   Sam            [Football, Rugby, Hockey]
2    Jo  [Table tennis, Swimming, Athletics]
3  Bill                      [Eating cheese]

104: Using free text for classification – ‘Bag of Words’

There may be times in healthcare where we would like to classify patients based on free text data we have for them. Maybe, for example, we would like to predict likely outcome based on free text clinical notes.

Using free text requires methods known as ‘Natural Language Processing’.

Here we start with one of the simplest techniques – ‘bag of words’.

In a ‘bag of words’ free text is reduced to a vector (a series of numbers) that represent the number of times a word is used in the text we are given. It is also posible to look at series of two, three or more words in case use of two or more words together helps to classify a patient.

A classic ‘toy problem’ used to help teach or develop methos is to try to judge whether people rates a film as ‘like’ or ‘did not like’ based on the free text they entered into a widely used internet film review database (www.imdb.com).

Here will will use 50,000 records from IMDb to convert each review into a ‘bag of words’, which we will then use in a simple logistic regression machine learning model.

We can use raw word counts, but in this case we’ll add an extra transformation called tf-idf (frequency–inverse document frequency) which adjusts values according to the number fo reviews that use the word. Words that occur across many reviews may be less discriminatory than words that occur more rarely, so tf-idf reduces the value of those words used frequently across reviews.

This code will take us through the following steps:

1) Load data from internet

2) Clean data – remove non-text, convert to lower case, reduce words to their ‘stems’ (see below for details), and reduce common ‘stop-words’ (such as ‘as’, ‘the’, ‘of’).

3) Split data into training and test data sets

4) Convert cleaned reviews in word vectors (‘bag of words’), and apply the tf-idf transform.

5) Train a logistic regression model on the tr-idf transformed word vectors.

6) Apply the logistic regression model to our previously unseen test cases, and calculate accuracy of our model

Load data

import pandas as pd

# If you do not already have the data locally you may download (and save) by

file_location = 'https://gitlab.com/michaelallen1966/00_python_snippets' +\
    '_and_recipes/raw/master/machine_learning/data/IMDb.csv'
imdb = pd.read_csv(file_location)
# save to current directory
imdb.to_csv('imdb.csv', index=False)

# If you already have the data locally then you may run the following

# Load data example
imdb = pd.read_csv('imdb.csv')

# Truncate data for example if you want to speed up the example
# imdb = imdb.head(5000)

Define Function to preprocess data

This function, as previously described, works on raw text strings, and:

1) changes to lower case

2) tokenizes (breaks down into words

3) removes punctuation and non-word text

4) finds word stems

5) removes stop words

6) rejoins meaningful stem words

import nltk
import pandas as pd
import numpy as np
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# If not previously performed:
# nltk.download('stopwords')

stemming = PorterStemmer()
stops = set(stopwords.words("english"))

def apply_cleaning_function_to_list(X):
    cleaned_X = []
    for element in X:
        cleaned_X.append(clean_text(element))
    return cleaned_X


def clean_text(raw_text):
    """This function works on a raw text string, and:
        1) changes to lower case
        2) tokenizes (breaks down into words
        3) removes punctuation and non-word text
        4) finds word stems
        5) removes stop words
        6) rejoins meaningful stem words"""
    
    # Convert to lower case
    text = raw_text.lower()
    
    # Tokenize
    tokens = nltk.word_tokenize(text)
    
    # Keep only words (removes punctuation + numbers)
    # use .isalnum to keep also numbers
    token_words = [w for w in tokens if w.isalpha()]
    
    # Stemming
    stemmed_words = [stemming.stem(w) for w in token_words]
    
    # Remove stop words
    meaningful_words = [w for w in stemmed_words if not w in stops]
    
    # Rejoin meaningful stemmed words
    joined_words = ( " ".join(meaningful_words))
    
    # Return cleaned data
    return joined_words

Apply the data cleaning function (this may take a few minutes if you are using the full 50,000 reviews).

# Get text to clean
text_to_clean = list(imdb['review'])

# Clean text
cleaned_text = apply_cleaning_function_to_list(text_to_clean)

# Add cleaned data back into DataFrame
imdb['cleaned_review'] = cleaned_text

# Remove temporary cleaned_text list (after transfer to DataFrame)
del cleaned_text

Split data into training and test data sets

from sklearn.model_selection import train_test_split
X = list(imdb['cleaned_review'])
y = list(imdb['sentiment'])
X_train, X_test, y_train, y_test = train_test_split(
    X,y, test_size = 0.25)

Create ‘bag of words’

The ‘bag of words’ is the word vector for each review. This may be a simple word count for each review where each position of the vector represents a word (returned in the ‘vocab’ list) and the value of that position represents the number of times that word is used in the review.

The function below also returns a tf-idf (frequency–inverse document frequency) which adjusts values according to the number fo reviews that use the word. Words that occur across many reviews may be less discriminatory than words that occur more rarely. The tf-idf transform reduces the value of a given word in proportion to the number of documents that it appears in.

The function returns the following:

1) vectorizer – this may be applied to any new reviews to convert the revier into the same word vector as the training set.

2) vocab – the list of words that the word vectors refer to.

3) train_data_features – raw word count vectors for each review

4) tfidf_features – tf-idf transformed word vectors

5) tfidf – the tf-idf transformation that may be applied to new reviews to convert the raw word counts into the transformed word counts in the same way as the training data.

Our vectorizer has an argument called ‘ngram_range’. A simple bag of words divides reviews into single words. If we have an ngram_range of (1,2) it means that the review is divided into single words and also pairs of consecutive words. This may be useful if pairs of words are useful, such as ‘very good’. The max_features argument limits the size of the word vector, in this case to a maximum of 10,000 words (or 10,000 ngrams of words if an ngram may be more than one word).

def create_bag_of_words(X):
    from sklearn.feature_extraction.text import CountVectorizer
    
    print ('Creating bag of words...')
    # Initialize the "CountVectorizer" object, which is scikit-learn's
    # bag of words tool.  
    
    # In this example features may be single words or two consecutive words
    # (as shown by ngram_range = 1,2)
    vectorizer = CountVectorizer(analyzer = "word",   \
                                 tokenizer = None,    \
                                 preprocessor = None, \
                                 stop_words = None,   \
                                 ngram_range = (1,2), \
                                 max_features = 10000
                                ) 

    # fit_transform() does two functions: First, it fits the model
    # and learns the vocabulary; second, it transforms our training data
    # into feature vectors. The input to fit_transform should be a list of 
    # strings. The output is a sparse array
    train_data_features = vectorizer.fit_transform(X)
    
    # Convert to a NumPy array for easy of handling
    train_data_features = train_data_features.toarray()
    
    # tfidf transform
    from sklearn.feature_extraction.text import TfidfTransformer
    tfidf = TfidfTransformer()
    tfidf_features = tfidf.fit_transform(train_data_features).toarray()

    # Get words in the vocabulary
    vocab = vectorizer.get_feature_names()
   
    return vectorizer, vocab, train_data_features, tfidf_features, tfidf

Apply our bag of words function to our training set.

vectorizer, vocab, train_data_features, tfidf_features, tfidf  = \
    create_bag_of_words(X_train)

We can create a DataFrame of our words and counts, so that we may sort and view them. The count and tfidf_features exist for each X (each review in this case) – here we will look at just the first review (index 0).

Note that the tfidf_features differ from the count; that is because of the adjustment for how commonly they occur across reviews.

(Try changing the sort to sort by tfidf_features).

bag_dictionary = pd.DataFrame()
bag_dictionary['ngram'] = vocab
bag_dictionary['count'] = train_data_features[0]
bag_dictionary['tfidf_features'] = tfidf_features[0]

# Sort by raw count
bag_dictionary.sort_values(by=['count'], ascending=False, inplace=True)
# Show top 10
print(bag_dictionary.head(10))

Out:

         ngram  count  tfidf_features
9320        wa      4        0.139373
5528      movi      3        0.105926
9728     whole      2        0.160024
3473    german      2        0.249079
6327      part      2        0.140005
293   american      1        0.089644
9409   wa kind      1        0.160155
9576      wast      1        0.087894
7380       saw      1        0.078477
7599      sens      1        0.085879

Training a machine learning model on the bag of words

Now we have transformed our free text reviews in vectors of numbers (representing words) we can apply many different machine learning techniques. Here will will use a relatively simple one, logistic regression.

We’ll set up a function to train a logistic regression model.

def train_logistic_regression(features, label):
    print ("Training the logistic regression model...")
    from sklearn.linear_model import LogisticRegression
    ml_model = LogisticRegression(C = 100,random_state = 0)
    ml_model.fit(features, label)
    print ('Finished')
    return ml_model

Now we will use the tf-idf tranformed word vectors to train the model (we could use the plain word counts contained in ‘train_data_features’ (rather than using ’tfidf_features’). We pass both the features and the known label corresponding to the review (the sentiment, either 0 or 1 depending on whether a person likes the film or not.

ml_model = train_logistic_regression(tfidf_features, y_train)

Applying the bag of words model to test reviews

We will now apply the bag of words model to test reviews, and assess the accuracy.

We’ll first apply our vectorizer to create a word vector for review in the test data set.

test_data_features = vectorizer.transform(X_test)
# Convert to numpy array
test_data_features = test_data_features.toarray()

As we are using the tf-idf transform, we’ll apply the tfid transformer so that word vectors are transformed in the same way as the training data set.

test_data_tfidf_features = tfidf.fit_transform(test_data_features)
# Convert to numpy array
test_data_tfidf_features = test_data_tfidf_features.toarray()

Now the bit that we really want to do – we’ll predict the sentiment of the all test reviews (and it’s just a single line of code!). Did they like the film or not?

predicted_y = ml_model.predict(test_data_tfidf_features)
correctly_identified_y = predicted_y == y_test
accuracy = np.mean(correctly_identified_y) * 100
print ('Accuracy = %.0f%%' %accuracy)

Out:

Accuracy = 87%

102: Pre-processing data: tokenization, stemming, and removal of stop words (compressed code)

In the previous code example (here) we went through each of the steps of cleaning text, showing what each step does. Below is compressed code that does the same, and can be applied to any list of text strings. Here we import the imdb data set, extract the review text and clean it, and put the cleaned reviews back into the imdb DataFrame.

import nltk
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# If not previously performed:
# nltk.download('stopwords')

stemming = PorterStemmer()
stops = set(stopwords.words("english"))

def apply_cleaning_function_to_list(X):
    cleaned_X = []
    for element in X:
        cleaned_X.append(clean_text(element))
    return cleaned_X


def clean_text(raw_text):
    """This function works on a raw text string, and:
        1) changes to lower case
        2) tokenizes (breaks down into words
        3) removes punctuation and non-word text
        4) finds word stems
        5) removes stop words
        6) rejoins meaningful stem words"""
    
    # Convert to lower case
    text = raw_text.lower()
    
    # Tokenize
    tokens = nltk.word_tokenize(text)
    
    # Keep only words (removes punctuation + numbers)
    # use .isalnum to keep also numbers
    token_words = [w for w in tokens if w.isalpha()]
    
    # Stemming
    stemmed_words = [stemming.stem(w) for w in token_words]
    
    # Remove stop words
    meaningful_words = [w for w in stemmed_words if not w in stops]
    
    # Rejoin meaningful stemmed words
    joined_words = ( " ".join(meaningful_words))
    
    # Return cleaned data
    return joined_words


### APPLY FUNCTIONS TO EXAMPLE DATA

# Load data example
imdb = pd.read_csv('imdb.csv')

# If you do not already have the data locally you may download (and save) by
# uncommenting and running the following lines

# file_location = 'https://gitlab.com/michaelallen1966/00_python_snippets' +\
#     '_and_recipes/raw/master/machine_learning/data/IMDb.csv'
# imdb = pd.read_csv(file_location)
# save to current directory
# imdb.to_csv('imdb.csv', index=False)

# Truncate data for example
imdb = imdb.head(100)

# Get text to clean
text_to_clean = list(imdb['review'])

# Clean text
cleaned_text = apply_cleaning_function_to_list(text_to_clean)

# Show first example
print ('Original text:',text_to_clean[0])
print ('\nCleaned text:', cleaned_text[0])

# Add cleaned data back into DataFrame
imdb['cleaned_review'] = cleaned_text


OUT:

Original text: I have no read the novel on which "The Kite Runner" is based. My wife and daughter, who did, thought the movie fell a long way short of the book, and I'm prepared to take their word for it. But, on its own, the movie is good -- not great but good. How accurately does it portray the havoc created by the Soviet invasion of Afghanistan? How convincingly does it show the intolerant Taliban regime that followed? I'd rate it C+ on the first and B+ on the second. The human story, the Afghan-American who returned to the country to rescue the son of his childhood playmate, is well done but it is on this count particularly that I'm told the book was far more convincing than the movie. The most exciting part of the film, however -- the kite contests in Kabul and, later, a mini-contest in California -- cannot have been equaled by the book. I'd wager money on that.

Cleaned text: read novel kite runner base wife daughter thought movi fell long way short book prepar take word movi good great good accur doe portray havoc creat soviet invas afghanistan convincingli doe show intoler taliban regim follow rate first second human stori return countri rescu son hi childhood playmat well done thi count particularli told book wa far convinc movi excit part film howev kite contest kabul later california equal book wager money

101: Pre-processing data: tokenization, stemming, and removal of stop words

Here we will look at three common pre-processing step sin natural language processing:

1) Tokenization: the process of segmenting text into words, clauses or sentences (here we will separate out words and remove punctuation).

2) Stemming: reducing related words to a common stem.

3) Removal of stop words: removal of commonly used words unlikely to be useful for learning.

We will load up 50,000 examples from the movie review database, imdb, and use the NLTK library for text pre-processing. The NLTK library comes with a standard Anaconda Python installation (www.anaconda.com), but we will need to use it to install the ‘stopwords’ corpus of words.

Downloading the NLTK library

This command will open the NLTK downloader. You may download everything from the collections tab. Otherwise, for this example you may just download ‘stopwords’ from the ‘Corpora’ tab.

import nltk
To open dialog download:
nltk.download();
To downlaod just stopwords:
nltk.download('stopwords');

Load data

If you have not previously loaded and saved the imdb data, run the following which will load the file from the internet and save it locally to the same location this is code is run from.

We will load data into a pandas DataFrame.

import pandas as pd
file_location = 'https://gitlab.com/michaelallen1966/00_python_snippets' +\
'_and_recipes/raw/master/machine_learning/data/IMDb.csv'
imdb = pd.read_csv(file_location)
save to current directory
imdb.to_csv('imdb.csv', index=False)

If you have already saved the data locally, load it up into memory:

import pandas as pd

imdb = pd.read_csv('imdb.csv')

Let’s look at what columns exist in the imdb data:

print (list(imdb))

Out:
['review', 'sentiment']

We’ll pull out the first review and sentiment to look at the contents. The review is text and the sentiment label is either 0 (negative) or 1 (positive) based on how the reviewer rated it on imdb.

We will convert all text to lower case.

imdb['review'] = imdb['review'].str.lower()

We’ll pull out the first review and sentiment to look at the contents. The review is text and the sentiment label is either 0 (negative) or 1 (positive) based on how the reviewer rated it on imdb.

example_review = imdb.iloc[0]
print(example_review['review'])

Out:
i have no read the novel on which "the kite runner" is based. my wife and daughter, who did, thought the movie fell a long way short of the book, and i'm prepared to take their word for it. but, on its own, the movie is good -- not great but good. how accurately does it portray the havoc created by the soviet invasion of afghanistan? how convincingly does it show the intolerant taliban regime that followed? i'd rate it c+ on the first and b+ on the second. the human story, the afghan-american who returned to the country to rescue the son of his childhood playmate, is well done but it is on this count particularly that i'm told the book was far more convincing than the movie. the most exciting part of the film, however -- the kite contests in kabul and, later, a mini-contest in california -- cannot have been equaled by the book. i'd wager money on that.


print(example_review['sentiment'])

Out:
1

Tokenization

We will use word_tokenize method from NLTK to split the review text into individual words (and you will see that punctuation is also produced as separate ‘words’).Let’s look at our example row.

import nltk
print (nltk.word_tokenize(example_review['review']))

Out:
['i', 'have', 'no', 'read', 'the', 'novel', 'on', 'which', '``', 'the', 'kite', 'runner', "''", 'is', 'based', '.', 'my', 'wife', 'and', 'daughter', ',', 'who', 'did', ',', 'thought', 'the', 'movie', 'fell', 'a', 'long', 'way', 'short', 'of', 'the', 'book', ',', 'and', 'i', "'m", 'prepared', 'to', 'take', 'their', 'word', 'for', 'it', '.', 'but', ',', 'on', 'its', 'own', ',', 'the', 'movie', 'is', 'good', '--', 'not', 'great', 'but', 'good', '.', 'how', 'accurately', 'does', 'it', 'portray', 'the', 'havoc', 'created', 'by', 'the', 'soviet', 'invasion', 'of', 'afghanistan', '?', 'how', 'convincingly', 'does', 'it', 'show', 'the', 'intolerant', 'taliban', 'regime', 'that', 'followed', '?', 'i', "'d", 'rate', 'it', 'c+', 'on', 'the', 'first', 'and', 'b+', 'on', 'the', 'second', '.', 'the', 'human', 'story', ',', 'the', 'afghan-american', 'who', 'returned', 'to', 'the', 'country', 'to', 'rescue', 'the', 'son', 'of', 'his', 'childhood', 'playmate', ',', 'is', 'well', 'done', 'but', 'it', 'is', 'on', 'this', 'count', 'particularly', 'that', 'i', "'m", 'told', 'the', 'book', 'was', 'far', 'more', 'convincing', 'than', 'the', 'movie', '.', 'the', 'most', 'exciting', 'part', 'of', 'the', 'film', ',', 'however', '--', 'the', 'kite', 'contests', 'in', 'kabul', 'and', ',', 'later', ',', 'a', 'mini-contest', 'in', 'california', '--', 'can', 'not', 'have', 'been', 'equaled', 'by', 'the', 'book', '.', 'i', "'d", 'wager', 'money', 'on', 'that', '.']

We will now apply the word_tokenize to all records, making a new column in our imdb DataFrame. Each entry will be a list of words. Here we will also strip out non alphanumeric words/characters (such as numbers and punctuation) using .isalpha (you could use .isalnum if you wanted to keep in numbers as well).

def identify_tokens(row):
    review = row['review']
    tokens = nltk.word_tokenize(review)
    # taken only words (not punctuation)
    token_words = [w for w in tokens if w.isalpha()]
    return token_words

imdb['words'] = imdb.apply(identify_tokens, axis=1)

Stemming

Stemming reduces related words to a common stem. It is an optional process step, and it it is useful to test accuracy with and without stemming. Let’s look at an example.

from nltk.stem import PorterStemmer
stemming = PorterStemmer()

my_list = ['frightening', 'frightened', 'frightens']

# Using a Python list comprehension method to apply to all words in my_list

print ([stemming.stem(word) for word in my_list])


Out:
['frighten', 'frighten', 'frighten']

To apply this to all rows in our imdb DataFrame we will again define a function and apply it to our DataFrame.

def stem_list(row):
    my_list = row['words']
    stemmed_list = [stemming.stem(word) for word in my_list]
    return (stemmed_list)

imdb['stemmed_words'] = imdb.apply(stem_list, axis=1)

Lets check our stemmed words (using pandas DataFrame .iloc method to select the first row).

['i', 'have', 'no', 'read', 'the', 'novel', 'on', 'which', 'the', 'kite', 'runner', 'is', 'base', 'my', 'wife', 'and', 'daughter', 'who', 'did', 'thought', 'the', 'movi', 'fell', 'a', 'long', 'way', 'short', 'of', 'the', 'book', 'and', 'i', 'prepar', 'to', 'take', 'their', 'word', 'for', 'it', 'but', 'on', 'it', 'own', 'the', 'movi', 'is', 'good', 'not', 'great', 'but', 'good', 'how', 'accur', 'doe', 'it', 'portray', 'the', 'havoc', 'creat', 'by', 'the', 'soviet', 'invas', 'of', 'afghanistan', 'how', 'convincingli', 'doe', 'it', 'show', 'the', 'intoler', 'taliban', 'regim', 'that', 'follow', 'i', 'rate', 'it', 'on', 'the', 'first', 'and', 'on', 'the', 'second', 'the', 'human', 'stori', 'the', 'who', 'return', 'to', 'the', 'countri', 'to', 'rescu', 'the', 'son', 'of', 'hi', 'childhood', 'playmat', 'is', 'well', 'done', 'but', 'it', 'is', 'on', 'thi', 'count', 'particularli', 'that', 'i', 'told', 'the', 'book', 'wa', 'far', 'more', 'convinc', 'than', 'the', 'movi', 'the', 'most', 'excit', 'part', 'of', 'the', 'film', 'howev', 'the', 'kite', 'contest', 'in', 'kabul', 'and', 'later', 'a', 'in', 'california', 'can', 'not', 'have', 'been', 'equal', 'by', 'the', 'book', 'i', 'wager', 'money', 'on', 'that']

Removing stop words

‘Stop words’ are commonly used words that are unlikely to have any benefit in natural language processing. These includes words such as ‘a’, ‘the’, ‘is’.

As before we will define a function and apply it to our DataFrame.

We create a set of words that we will call ‘stops’ (using a set helps to speed up removing stop words).

from nltk.corpus import stopwords
stops = set(stopwords.words("english"))                  

def remove_stops(row):
    my_list = row['stemmed_words']
    meaningful_words = [w for w in my_list if not w in stops]
    return (meaningful_words)

imdb['stem_meaningful'] = imdb.apply(remove_stops, axis=1)

Show the stemmed words, without stop words, from the first record.

print(imdb['stem_meaningful'][0])

Out:
['read', 'novel', 'kite', 'runner', 'base', 'wife', 'daughter', 'thought', 'movi', 'fell', 'long', 'way', 'short', 'book', 'prepar', 'take', 'word', 'movi', 'good', 'great', 'good', 'accur', 'doe', 'portray', 'havoc', 'creat', 'soviet', 'invas', 'afghanistan', 'convincingli', 'doe', 'show', 'intoler', 'taliban', 'regim', 'follow', 'rate', 'first', 'second', 'human', 'stori', 'return', 'countri', 'rescu', 'son', 'hi', 'childhood', 'playmat', 'well', 'done', 'thi', 'count', 'particularli', 'told', 'book', 'wa', 'far', 'convinc', 'movi', 'excit', 'part', 'film', 'howev', 'kite', 'contest', 'kabul', 'later', 'california', 'equal', 'book', 'wager', 'money']

Rejoin words

Now we will rejoin our meaningful stemmed words into a single string.

def rejoin_words(row):
    my_list = row['stem_meaningful']
    joined_words = ( " ".join(my_list))
    return joined_words

imdb['processed'] = imdb.apply(rejoin_words, axis=1)

Save processed data

Now we’ll save our processed data as a csv. We’ll drop the intermediate columns in our Pandas DataFrame.

cols_to_drop = ['Unnamed: 0', 'review', 'words', 'stemmed_words', 'stem_meaningful']
imdb.drop(cols_to_drop, inplace=True)

imdb.to_csv('imdb_processed.csv', index=False)