This blog post will explain how to extract noun phrases from a sentence with python.
From the nltk library, pos_tag, and RegexpParser need to be imported. The re library has to be imported also. They are for part of speech tagging the sample text, breaking down the sample text into a list of words or tokens, and parsing the part of speech tagged sentence.
import re
from nltk import pos_tag, RegexpParser
This is the sample text that will be used for this example.
sample = "i saw the big dog on the hill"
Tokenize the sample text.
words = re.sub(r'[^\w\s]', '', sample).lower().split(' ')
Determine each words part of speech tag.
tagged = pos_tag(words)
The regexpparser uses a rule constructed using regular expressions.
chunker = RegexpParser("""
NP: {<DT>?<JJ.*>*<NN.*>+}
""")
Declare an empty list to hold noun phrases extracted from the sample.
noun_phrases = []
Create a tree based on the rules of the regexpparser.
tree = chunker.parse(tagged)
Traverse the tree and construct noun phrases with tag info extracted from the sample.
for subtree in tree.subtrees():
if subtree.label() == 'NP':
np = " ".join(word for word, pos in subtree.leaves())
noun_phrases.append(np)
Output results.
for np in noun_phrases:
print(np)
This is what the whole source code looks like.
import re
from nltk import pos_tag, RegexpParser
sample = "i saw the big dog on the hill"
words = re.sub(r'[^\w\s]', '', sample).lower().split(' ')
tagged = pos_tag(words)
chunker = RegexpParser("""
NP: {<DT>?<JJ.*>*<NN.*>+}
""")
noun_phrases = []
tree = chunker.parse(tagged)
for subtree in tree.subtrees():
if subtree.label() == 'NP':
np = " ".join(word for word, pos in subtree.leaves())
noun_phrases.append(np)
for np in noun_phrases:
print(np)
