dedtech.info

Information about computer technology.

Sentence Tokenization With Python

This blog post will explain how to tokenize sentences from a set of sentences stored in a string with Python. Regular expressions will be used to accomplish this task. First, the required library has to be imported.

import re

A function named tokenizer will have to be declared. The function takes a string as an argument. That string contains three sentences that are organized like a paragraph. One sentence ends with a period, another with a question mark, and the other with an exclamation point. Two regular expressions will be used to break up a paragraph of three sentences stored in a string into a list of sentences. The first regular expression that is used is (?<=.|\?|!). That expression matches the punctuation marks at the end of the sentence. The second regular expression that is used is \s. That expression matches the space at the end of the punctuation mark. The variable sentences holds the result of the re.split function. The re.split function takes the regular expressions stored in the variable pattern and the function argument t and returns a list of tokenized sentences.

def tokenizer(t):
    pattern = r'(?<=\.|\?|!)\s'
    sentences = re.split(pattern, t)
    return sentences

A variable named text is declared that holds three simple sentences. The variable r holds the result of the tokenizer function that takes the variable text as an argument.

text = "Hello! How are you doing today? I hope you're having a great day."
r = tokenizer(text)

The tokenized sentences stored in r are printed out.

for s in r:
    print(s)

This is what the whole source code looks like.

import re

def tokenizer(t):
    pattern = r'(?<=\.|\?|!)\s'
    sentences = re.split(pattern, t)
    return sentences

text = "Hello! How are you doing today? I hope you are having a great day."
r = tokenizer(text)

for s in r:
    print(s)

Leave a Reply

Your email address will not be published. Required fields are marked *