103: POS (Parts of Speech) tagging – labeling words as nouns, verbs, adjectives, etc.

POS tagging labels words by type of word, which may enhance the quality of information that may be extracted from a piece of text.

There are varying sets of tags, but the common universal set is:

ADJ: adjective
ADP: adposition (preopositions and postpositions)
ADV: adverb
AUX: auxiliary
CCONJ: coordinating conjunction
DET: determiner
INTJ: interjection
NOUN: noun
NUM: numeral
PRT: particle or other function words
PRON: pronoun
VERB: verb
X: other
.: Punctuation

Other, more granular sets of tags include those included in the Brown Corpus (a coprpus of text with tags). NLTK can convert more granular data sets to tagged sets.

The two most commonly used tagged corpus datasets in NLTK are Penn Treebank and Brown Corpus. Both take text from a wide range of sources and tag words.

Details of the brown corpus and Penn treebank tags may be found here:

An example of tagging from the Brown corpus, and conversion to the universal tag set

import nltk
# Download the brown corpus if it has not previously been downloaded
nltk.download('brown');

from nltk.corpus import brown
# Show a set of tagged words from the Brown corpus
print(brown.tagged_words()[20:40])

Out:

[('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.'), ('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD')]

Convert more granular brown tagging to universal tagging.

print(brown.tagged_words(tagset='universal')[20:40])

Out:

[('any', 'DET'), ('irregularities', 'NOUN'), ('took', 'VERB'), ('place', 'NOUN'), ('.', '.'), ('The', 'DET'), ('jury', 'NOUN'), ('further', 'ADV'), ('said', 'VERB'), ('in', 'ADP'), ('term-end', 'NOUN'), ('presentments', 'NOUN'), ('that', 'ADP'), ('the', 'DET'), ('City', 'NOUN'), ('Executive', 'ADJ'), ('Committee', 'NOUN'), (',', '.'), ('which', 'DET'), ('had', 'VERB')]

Details of the brown corpus tags may be found here:

https://en.wikipedia.org/wiki/Brown_Corpus

In the above example the brown tags NNS (plural noun), NN (singlular noun) and NN-TL (singluar noun found in a title) are all converted to the universal tag NOUN.

Use of tagging to distinguish between different meanings of the same word

Consider the two uses of the word ‘left’ in the sentence below:

text = "I left the hotel to go to the coffee shop which is on the left of the church"

Let’s look at how ‘left’ is tagged in the two sentences:

# Split text into words
tokens = nltk.word_tokenize(text)

print ('Word tags for text:', nltk.pos_tag(tokens, tagset="universal"))

OUT:

Word tags for text: [('I', 'PRON'), ('left', 'VERB'), ('the', 'DET'), ('hotel', 'NOUN'), ('to', 'PRT'), ('go', 'VERB'), ('to', 'PRT'), ('the', 'DET'), ('coffee', 'NOUN'), ('shop', 'NOUN'), ('which', 'DET'), ('is', 'VERB'), ('on', 'ADP'), ('the', 'DET'), ('left', 'NOUN'), ('of', 'ADP'), ('the', 'DET'), ('church', 'NOUN')]

‘The first use of ‘left’ has been identified as a verb, and the second use a noun.

So POS-tagging may be used to enhance simple text-based methods, by providing additional information about words taking into account the context of the word.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s