108. Converting text to numbers

Machine learning routines work on numbers rather text, so we may frequently have to convert our text to numbers. Below is a function for one of the simplest ways to convert text to numbers. Each word is given an index number (and here we give more frequent words lower index numbers).

This function uses ‘tokenized’ text – that is text that has been pre-processed into lists of words. Tokenization also usually involves other cleaning steps, such as converting all words to lower case and removing ‘stop words’, that is words such as ‘the’ that have little value in machine learning. If you need code for tokenization, please see here, though if all you need to do is the break a sentence into words then this may be done with:

import nltk
tokens = nltk.word_tokenize(text)

Here is the function to convert strings of tokenized text:

import nltk
import numpy as np
import pandas as pd

def text_to_numbers(text, cutoff_for_rare_words = 1):
    """Function to convert text to numbers. Text must be tokenzied so that
    test is presented as a list of words. The index number for a word
    is based on its frequency (words occuring more often have a lower index).
    If a word does not occur as many times as cutoff_for_rare_words,
    then it is given a word index of zero. All rare words will be zero.
    # Flatten list if sublists are present
    if len(text) > 1:
        flat_text = [item for sublist in text for item in sublist]
        flat_text = text
    # get word freuqncy
    fdist = nltk.FreqDist(flat_text)

    # Convert to Pandas dataframe
    df_fdist = pd.DataFrame.from_dict(fdist, orient='index')
    df_fdist.columns = ['Frequency']

    # Sort by word frequency
    df_fdist.sort_values(by=['Frequency'], ascending=False, inplace=True)

    # Add word index
    number_of_words = df_fdist.shape[0]
    df_fdist['word_index'] = list(np.arange(number_of_words)+1)

    # replace rare words with index zero
    frequency = df_fdist['Frequency'].values
    word_index = df_fdist['word_index'].values
    mask = frequency <= cutoff_for_rare_words
    word_index[mask] = 0
    df_fdist['word_index'] =  word_index
    # Convert pandas to dictionary
    word_dict = df_fdist['word_index'].to_dict()
    # Use dictionary to convert words in text to numbers
    text_numbers = []
    for string in text:
        string_numbers = [word_dict[word] for word in string]
    return (text_numbers)

Now let’s see the function in action.

# An example tokenised list

text = [['hello', 'world', 'Michael'],
         ['hello', 'world', 'sam'],
         ['hello', 'universe'],
         ['michael', 'makes', 'a', 'good', 'cup', 'of', 'tea'],
         ['tea', 'is', 'nice'],
         ['michael', 'is', 'nice']]

text_numbers = text_to_numbers(text)
print (text_numbers)


[[1, 2, 0], [1, 2, 0], [1, 0], [3, 0, 0, 0, 0, 0, 4], [4, 5, 6], [3, 5, 6]]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s