This blog post will explain how to do part of speech tagging with Python.
The brown library has to be imported so the tagger can be trained using tagged sentences. The counter and defaultdict libraries have to be imported so the dictionary is setup for the tagger.
from nltk.corpus import brown
from collections import Counter, defaultdict
A function has to be declared that does the part of speech tagging. An if statement will see if a word is in the dictornary count_dict. If yes, the function will return the most likely part of speech tag. If no, the function will return the part of speech tag noun. Even though the most likely part of speech tag will be returned, sometimes the way a word is placed in a sentence will determine its part of speech tag. That means this part of speech tagger will not be that accurate.
def pos_tag(word):
if word in count_dict:
return count_dict[word].most_common(1)[0][0]
else:
return "NOUN"
A variable named sentences contains tagged sentences from the Brown corpus that will be used to train the part of speech tagger. The tagset used for this example is the universal tagset.
sentences = brown.tagged_sents(tagset='universal')
A dictionary is declared.
count_dict = defaultdict(Counter)
A double for loop will extract and count each word and tag pair from each sentence. Each word tag pair and its count will be added to the dictionary.
for sentence in sentences:
for word, tag in sentence:
count_dict[word][tag] += 1
An empty list named result is declared.
result = []
The variable test_sentence will hold the sentence that will be part of speech tagged.
test_sentence = "The grand jury"
The sentence is lowercased and tokenized before being processed.
tokens = test_sentence.lower().split()
The tokens are processed.
for token in tokens:
result.append([token,pos_tag(token)])
The result is printed to the screen.
print(result)
The universal tagset the part of speech tagger uses.
| ADJ | adjective |
| ADP | adposition |
| ADV | adverb |
| AUX | auxiliary |
| CCONJ | coordinating conjunction |
| DET | determiner |
| INTJ | interjection |
| NOUN | noun |
| NUM | numeral |
| PART | particle |
| PRON | pronoun |
| PROPN | proper noun |
| PUNCT | punctuation |
| SCONJ | subordinating conjunction |
| SYM | symbol |
| VERB | verb |
| X | other |
This is what the whole source code looks like.
from nltk.corpus import brown
from collections import Counter, defaultdict
def pos_tag(word):
if word in count_dict:
return count_dict[word].most_common(1)[0][0]
else:
return "NOUN"
sentences = brown.tagged_sents(tagset='universal')
count_dict = defaultdict(Counter)
for sentence in sentences:
for word, tag in sentence:
count_dict[word][tag] += 1
result = []
test_sentence = "The grand jury"
tokens = test_sentence.lower().split()
for token in tokens:
result.append([token,pos_tag(token)])
print(result)
