102: Pre-processing data: tokenization, stemming, and removal of stop words (compressed code)

In the previous code example (here) we went through each of the steps of cleaning text, showing what each step does. Below is compressed code that does the same, and can be applied to any list of text strings. Here we import the imdb data set, extract the review text and clean it, and put the cleaned reviews back into the imdb DataFrame.

import nltk
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# If not previously performed:
# nltk.download('stopwords')

stemming = PorterStemmer()
stops = set(stopwords.words("english"))

def apply_cleaning_function_to_list(X):
    cleaned_X = []
    for element in X:
        cleaned_X.append(clean_text(element))
    return cleaned_X


def clean_text(raw_text):
    """This function works on a raw text string, and:
        1) changes to lower case
        2) tokenizes (breaks down into words
        3) removes punctuation and non-word text
        4) finds word stems
        5) removes stop words
        6) rejoins meaningful stem words"""
    
    # Convert to lower case
    text = raw_text.lower()
    
    # Tokenize
    tokens = nltk.word_tokenize(text)
    
    # Keep only words (removes punctuation + numbers)
    # use .isalnum to keep also numbers
    token_words = [w for w in tokens if w.isalpha()]
    
    # Stemming
    stemmed_words = [stemming.stem(w) for w in token_words]
    
    # Remove stop words
    meaningful_words = [w for w in stemmed_words if not w in stops]
    
    # Rejoin meaningful stemmed words
    joined_words = ( " ".join(meaningful_words))
    
    # Return cleaned data
    return joined_words


### APPLY FUNCTIONS TO EXAMPLE DATA

# Load data example
imdb = pd.read_csv('imdb.csv')

# If you do not already have the data locally you may download (and save) by
# uncommenting and running the following lines

# file_location = 'https://gitlab.com/michaelallen1966/00_python_snippets' +\
#     '_and_recipes/raw/master/machine_learning/data/IMDb.csv'
# imdb = pd.read_csv(file_location)
# save to current directory
# imdb.to_csv('imdb.csv', index=False)

# Truncate data for example
imdb = imdb.head(100)

# Get text to clean
text_to_clean = list(imdb['review'])

# Clean text
cleaned_text = apply_cleaning_function_to_list(text_to_clean)

# Show first example
print ('Original text:',text_to_clean[0])
print ('\nCleaned text:', cleaned_text[0])

# Add cleaned data back into DataFrame
imdb['cleaned_review'] = cleaned_text


OUT:

Original text: I have no read the novel on which "The Kite Runner" is based. My wife and daughter, who did, thought the movie fell a long way short of the book, and I'm prepared to take their word for it. But, on its own, the movie is good -- not great but good. How accurately does it portray the havoc created by the Soviet invasion of Afghanistan? How convincingly does it show the intolerant Taliban regime that followed? I'd rate it C+ on the first and B+ on the second. The human story, the Afghan-American who returned to the country to rescue the son of his childhood playmate, is well done but it is on this count particularly that I'm told the book was far more convincing than the movie. The most exciting part of the film, however -- the kite contests in Kabul and, later, a mini-contest in California -- cannot have been equaled by the book. I'd wager money on that.

Cleaned text: read novel kite runner base wife daughter thought movi fell long way short book prepar take word movi good great good accur doe portray havoc creat soviet invas afghanistan convincingli doe show intoler taliban regim follow rate first second human stori return countri rescu son hi childhood playmat well done thi count particularli told book wa far convinc movi excit part film howev kite contest kabul later california equal book wager money

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s