Tokenize A String With Python

This blog post will explain how to tokenize a string with Python.

First, the regular expressions library has to be loaded.

import re

Then a string variable is declared.

s = "How's it going?"

The variable clean will hold the result of the re.sub function. The re.sub function will strip a string of all punctuation marks. The function is also appended by a lower function. That will turn all characters in the string to lowercase.

clean = re.sub(r'[^\w\s]', '', s).lower()

The tokens variable will hold the result of the clean.split function. That function will convert the string into a list of tokens using a space as a delimiter.

tokens = clean.split(' ')

Then the tokens are printed out.

print(tokens)

This is what the whole code looks like.

import re

s = "How's it going?"
clean = re.sub(r'[^\w\s]', '', s).lower()
tokens = clean.split(' ')
print(tokens)

Tokenize A String With Python

Leave a Reply Cancel reply