This blog post will explain how to save a language model to a CSV file with Python. A language model is something that can be used to generate random text. It is easy to store a simple statistical language model in a csv file.
The csv library has to be imported for the program to be able to read a csv file. The re library has to be imported for the tokenizer to strip punctuation off the text corpus. The defaultdict library from collections has to be imported so the program can store the language model that is read from a csv file.
import csv
import re
from collections import defaultdict
The function below will generate bigrams from a list of tokens.
def generate_ngrams(tokens,n):
ngrams = []
for i in range(len(tokens) - (n - 1)):
ngrams.append(tokens[i:i+n])
return ngrams
There is a file that the bigram language model is saved to.
filename = "bigram_counts.csv"
Define the text corpus used by this example
text = "Natural language processing is fun. Language models can be statistical or neural."
Define a defaultdict(int) object to hold the bigram language model.
bigram_counts = defaultdict(int)
Tokenize the text corpus.
clean = re.sub(r'[^\w\s]', '', text).lower()
tokens = clean.split(' ')
Generate bigrams for building language model.
bigrams = generate_ngrams(tokens,2)
Build bigram language model.
for bigram in bigrams:
bigram_counts[tuple(bigram)] += 1
Save bigram language model to a csv file.
with open(filename, mode="w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["word1", "word2", "count"])
for bigram, count in bigram_counts.items():
writer.writerow([bigram[0],bigram[1],count])
This is what the whole source code looks like.
import csv
import re
from collections import defaultdict
def generate_ngrams(tokens,n):
ngrams = []
for i in range(len(tokens) - (n - 1)):
ngrams.append(tokens[i:i+n])
return ngrams
filename = "bigram_model.csv"
text = "Natural language processing is fun. Language models can be statistical or neural."
bigram_counts = defaultdict(int)
clean = re.sub(r'[^\w\s]', '', text).lower()
tokens = clean.split(' ')
bigrams = generate_ngrams(tokens,2)
for bigram in bigrams:
bigram_counts[tuple(bigram)] += 1
with open(filename, mode="w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["word1", "word2", "count"])
for bigram, count in bigram_counts.items():
writer.writerow([bigram[0],bigram[1],count])
