This blog post will explain how to tokenize a string with Python.
First, the regular expressions library has to be loaded.
import re
Then a string variable is declared.
s = "How's it going?"
The variable clean will hold the result of the re.sub function. The re.sub function will strip a string of all punctuation marks. The function is also appended by a lower function. That will turn all characters in the string to lowercase.
clean = re.sub(r'[^\w\s]', '', s).lower()
The tokens variable will hold the result of the clean.split function. That function will convert the string into a list of tokens using a space as a delimiter.
tokens = clean.split(' ')
Then the tokens are printed out.
print(tokens)
This is what the whole code looks like.
import re
s = "How's it going?"
clean = re.sub(r'[^\w\s]', '', s).lower()
tokens = clean.split(' ')
print(tokens)
