This blog post will explain how to count noun phrases in a list of sentences with Python.
From the nltk library, pos_tag, and RegexpParser need to be imported. The re library has to be imported. From the collections library, Counter needs to be imported. They are for part of speech tagging the sample text, breaking down the sample text into a list of words or tokens, and parsing the part of speech tagged sentence. A counter object will count noun phrases extracted from the sample text.
import re
from nltk import pos_tag, RegexpParser
from collections import Counter
A list of sentences is declared.
sentences = [
"The pizza is ready.",
"There are two toppings on it.",
"The pizza was delivered."
]
A grammar rule is declared that parses a tagged sentence into noun phrases.
grammar = r"""
NP: {<DT>?<JJ.*>*<NN.*>+}
"""
A chunk parser is created that uses a grammar rule that is useful for extracting noun phrases out of a sentence.
chunk_parser = RegexpParser(grammar)
An empty list is declared that will hold all noun phrases that are extracted out of the sentence.
noun_phrases = []
The list of sentences is processed. Each sentence is tokenized and part of speech tagged. After that, the part of speech tagged sentence is inputted into the chunk parser. Then a tree structure is used for extracting noun phrases from each sentence.
for sentence in sentences:
words = re.sub(r'[^\w\s]', '', sentence).lower().split(' ')
tagged = pos_tag(words)
tree = chunk_parser.parse(tagged)
for subtree in tree.subtrees():
if subtree.label() == 'NP':
np = " ".join(word for word, pos in subtree.leaves())
noun_phrases.append(np)
A counter object named counts is declared that will hold the counts for all of the noun phrases.
counts = Counter(noun_phrases)
The counter object is traversed and each noun phrase and its count are printed to the screen.
for phrase, count in counts.items():
print(phrase,count)
This is what the whole source code looks like.
import re
from nltk import pos_tag, RegexpParser
from collections import Counter
sentences = [
"The pizza is ready.",
"There are two toppings on it.",
"The pizza was delivered."
]
grammar = r"""
NP: {<DT>?<JJ.*>*<NN.*>+}
"""
chunk_parser = RegexpParser(grammar)
noun_phrases = []
for sentence in sentences:
words = re.sub(r'[^\w\s]', '', sentence).lower().split(' ')
tagged = pos_tag(words)
tree = chunk_parser.parse(tagged)
for subtree in tree.subtrees():
if subtree.label() == 'NP':
np = " ".join(word for word, pos in subtree.leaves())
noun_phrases.append(np)
counts = Counter(noun_phrases)
for phrase, count in counts.items():
print(phrase,count)
