101: Pre-processing data: tokenization, stemming, and removal of stop words

Here we will look at three common pre-processing step sin natural language processing:

1) Tokenization: the process of segmenting text into words, clauses or sentences (here we will separate out words and remove punctuation).

2) Stemming: reducing related words to a common stem.

3) Removal of stop words: removal of commonly used words unlikely to be useful for learning.

We will load up 50,000 examples from the movie review database, imdb, and use the NLTK library for text pre-processing. The NLTK library comes with a standard Anaconda Python installation (www.anaconda.com), but we will need to use it to install the ‘stopwords’ corpus of words.

Downloading the NLTK library

This command will open the NLTK downloader. You may download everything from the collections tab. Otherwise, for this example you may just download ‘stopwords’ from the ‘Corpora’ tab.

import nltk
To open dialog download:
nltk.download();
To downlaod just stopwords:
nltk.download('stopwords');

Load data

If you have not previously loaded and saved the imdb data, run the following which will load the file from the internet and save it locally to the same location this is code is run from.

We will load data into a pandas DataFrame.

import pandas as pd
file_location = 'https://gitlab.com/michaelallen1966/00_python_snippets' +\
'_and_recipes/raw/master/machine_learning/data/IMDb.csv'
imdb = pd.read_csv(file_location)
save to current directory
imdb.to_csv('imdb.csv', index=False)

If you have already saved the data locally, load it up into memory:

import pandas as pd

imdb = pd.read_csv('imdb.csv')

Let’s look at what columns exist in the imdb data:

print (list(imdb))

Out:
['review', 'sentiment']

We’ll pull out the first review and sentiment to look at the contents. The review is text and the sentiment label is either 0 (negative) or 1 (positive) based on how the reviewer rated it on imdb.

We will convert all text to lower case.

imdb['review'] = imdb['review'].str.lower()

We’ll pull out the first review and sentiment to look at the contents. The review is text and the sentiment label is either 0 (negative) or 1 (positive) based on how the reviewer rated it on imdb.

example_review = imdb.iloc[0]
print(example_review['review'])

Out:
i have no read the novel on which "the kite runner" is based. my wife and daughter, who did, thought the movie fell a long way short of the book, and i'm prepared to take their word for it. but, on its own, the movie is good -- not great but good. how accurately does it portray the havoc created by the soviet invasion of afghanistan? how convincingly does it show the intolerant taliban regime that followed? i'd rate it c+ on the first and b+ on the second. the human story, the afghan-american who returned to the country to rescue the son of his childhood playmate, is well done but it is on this count particularly that i'm told the book was far more convincing than the movie. the most exciting part of the film, however -- the kite contests in kabul and, later, a mini-contest in california -- cannot have been equaled by the book. i'd wager money on that.


print(example_review['sentiment'])

Out:
1

Tokenization

We will use word_tokenize method from NLTK to split the review text into individual words (and you will see that punctuation is also produced as separate ‘words’).Let’s look at our example row.

import nltk
print (nltk.word_tokenize(example_review['review']))

Out:
['i', 'have', 'no', 'read', 'the', 'novel', 'on', 'which', '``', 'the', 'kite', 'runner', "''", 'is', 'based', '.', 'my', 'wife', 'and', 'daughter', ',', 'who', 'did', ',', 'thought', 'the', 'movie', 'fell', 'a', 'long', 'way', 'short', 'of', 'the', 'book', ',', 'and', 'i', "'m", 'prepared', 'to', 'take', 'their', 'word', 'for', 'it', '.', 'but', ',', 'on', 'its', 'own', ',', 'the', 'movie', 'is', 'good', '--', 'not', 'great', 'but', 'good', '.', 'how', 'accurately', 'does', 'it', 'portray', 'the', 'havoc', 'created', 'by', 'the', 'soviet', 'invasion', 'of', 'afghanistan', '?', 'how', 'convincingly', 'does', 'it', 'show', 'the', 'intolerant', 'taliban', 'regime', 'that', 'followed', '?', 'i', "'d", 'rate', 'it', 'c+', 'on', 'the', 'first', 'and', 'b+', 'on', 'the', 'second', '.', 'the', 'human', 'story', ',', 'the', 'afghan-american', 'who', 'returned', 'to', 'the', 'country', 'to', 'rescue', 'the', 'son', 'of', 'his', 'childhood', 'playmate', ',', 'is', 'well', 'done', 'but', 'it', 'is', 'on', 'this', 'count', 'particularly', 'that', 'i', "'m", 'told', 'the', 'book', 'was', 'far', 'more', 'convincing', 'than', 'the', 'movie', '.', 'the', 'most', 'exciting', 'part', 'of', 'the', 'film', ',', 'however', '--', 'the', 'kite', 'contests', 'in', 'kabul', 'and', ',', 'later', ',', 'a', 'mini-contest', 'in', 'california', '--', 'can', 'not', 'have', 'been', 'equaled', 'by', 'the', 'book', '.', 'i', "'d", 'wager', 'money', 'on', 'that', '.']

We will now apply the word_tokenize to all records, making a new column in our imdb DataFrame. Each entry will be a list of words. Here we will also strip out non alphanumeric words/characters (such as numbers and punctuation) using .isalpha (you could use .isalnum if you wanted to keep in numbers as well).

def identify_tokens(row):
    review = row['review']
    tokens = nltk.word_tokenize(review)
    # taken only words (not punctuation)
    token_words = [w for w in tokens if w.isalpha()]
    return token_words

imdb['words'] = imdb.apply(identify_tokens, axis=1)

Stemming

Stemming reduces related words to a common stem. It is an optional process step, and it it is useful to test accuracy with and without stemming. Let’s look at an example.

from nltk.stem import PorterStemmer
stemming = PorterStemmer()

my_list = ['frightening', 'frightened', 'frightens']

# Using a Python list comprehension method to apply to all words in my_list

print ([stemming.stem(word) for word in my_list])


Out:
['frighten', 'frighten', 'frighten']

To apply this to all rows in our imdb DataFrame we will again define a function and apply it to our DataFrame.

def stem_list(row):
    my_list = row['words']
    stemmed_list = [stemming.stem(word) for word in my_list]
    return (stemmed_list)

imdb['stemmed_words'] = imdb.apply(stem_list, axis=1)

Lets check our stemmed words (using pandas DataFrame .iloc method to select the first row).

['i', 'have', 'no', 'read', 'the', 'novel', 'on', 'which', 'the', 'kite', 'runner', 'is', 'base', 'my', 'wife', 'and', 'daughter', 'who', 'did', 'thought', 'the', 'movi', 'fell', 'a', 'long', 'way', 'short', 'of', 'the', 'book', 'and', 'i', 'prepar', 'to', 'take', 'their', 'word', 'for', 'it', 'but', 'on', 'it', 'own', 'the', 'movi', 'is', 'good', 'not', 'great', 'but', 'good', 'how', 'accur', 'doe', 'it', 'portray', 'the', 'havoc', 'creat', 'by', 'the', 'soviet', 'invas', 'of', 'afghanistan', 'how', 'convincingli', 'doe', 'it', 'show', 'the', 'intoler', 'taliban', 'regim', 'that', 'follow', 'i', 'rate', 'it', 'on', 'the', 'first', 'and', 'on', 'the', 'second', 'the', 'human', 'stori', 'the', 'who', 'return', 'to', 'the', 'countri', 'to', 'rescu', 'the', 'son', 'of', 'hi', 'childhood', 'playmat', 'is', 'well', 'done', 'but', 'it', 'is', 'on', 'thi', 'count', 'particularli', 'that', 'i', 'told', 'the', 'book', 'wa', 'far', 'more', 'convinc', 'than', 'the', 'movi', 'the', 'most', 'excit', 'part', 'of', 'the', 'film', 'howev', 'the', 'kite', 'contest', 'in', 'kabul', 'and', 'later', 'a', 'in', 'california', 'can', 'not', 'have', 'been', 'equal', 'by', 'the', 'book', 'i', 'wager', 'money', 'on', 'that']

Removing stop words

‘Stop words’ are commonly used words that are unlikely to have any benefit in natural language processing. These includes words such as ‘a’, ‘the’, ‘is’.

As before we will define a function and apply it to our DataFrame.

We create a set of words that we will call ‘stops’ (using a set helps to speed up removing stop words).

from nltk.corpus import stopwords
stops = set(stopwords.words("english"))                  

def remove_stops(row):
    my_list = row['stemmed_words']
    meaningful_words = [w for w in my_list if not w in stops]
    return (meaningful_words)

imdb['stem_meaningful'] = imdb.apply(remove_stops, axis=1)

Show the stemmed words, without stop words, from the first record.

print(imdb['stem_meaningful'][0])

Out:
['read', 'novel', 'kite', 'runner', 'base', 'wife', 'daughter', 'thought', 'movi', 'fell', 'long', 'way', 'short', 'book', 'prepar', 'take', 'word', 'movi', 'good', 'great', 'good', 'accur', 'doe', 'portray', 'havoc', 'creat', 'soviet', 'invas', 'afghanistan', 'convincingli', 'doe', 'show', 'intoler', 'taliban', 'regim', 'follow', 'rate', 'first', 'second', 'human', 'stori', 'return', 'countri', 'rescu', 'son', 'hi', 'childhood', 'playmat', 'well', 'done', 'thi', 'count', 'particularli', 'told', 'book', 'wa', 'far', 'convinc', 'movi', 'excit', 'part', 'film', 'howev', 'kite', 'contest', 'kabul', 'later', 'california', 'equal', 'book', 'wager', 'money']

Rejoin words

Now we will rejoin our meaningful stemmed words into a single string.

def rejoin_words(row):
    my_list = row['stem_meaningful']
    joined_words = ( " ".join(my_list))
    return joined_words

imdb['processed'] = imdb.apply(rejoin_words, axis=1)

Save processed data

Now we’ll save our processed data as a csv. We’ll drop the intermediate columns in our Pandas DataFrame.

cols_to_drop = ['Unnamed: 0', 'review', 'words', 'stemmed_words', 'stem_meaningful']
imdb.drop(cols_to_drop, inplace=True)

imdb.to_csv('imdb_processed.csv', index=False)

One thought on “101: Pre-processing data: tokenization, stemming, and removal of stop words

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s