Tokenize A String With Python

This blog post will explain how to tokenize a string with Python.

First, the regular expressions library has to be loaded.

import re

Then a string variable is declared.

s = "How's it going?"

The variable tokens will hold the result of the re.sub function with the functions lower and split appended to it. The re.sub function will strip a string of all punctuation marks. The lower function will turn all characters in the string to lowercase. The split function will convert the string into a list of tokens or words using a space as a delimiter.

tokens = re.sub(r'[^\w\s]', '', s).lower().split(' ')

Then the tokens are printed out.

print(tokens)

This is what the whole code looks like.

import re

s = "How's it going?"
tokens = re.sub(r'[^\w\s]', '', s).lower().split(' ')
print(tokens)

Tokenize A String With Python

Leave a Reply Cancel reply