This blog post will explain how to load a bigram language model with Python. A language model is something that can be used to generate random text. It is easy to store a simple statistical language model in a csv file.
This is what the csv file looks like.
word1,word2,count
natural,language,1
language,processing,1
processing,can,1
can,be,2
be,fun,1
fun,language,1
language,models,1
models,can,1
be,statistical,1
statistical,or,1
or,neural,1
The csv library has to be imported for the program to be able to read a csv file. The defaultdict library from collections has to be imported so the program can store the language model that is read from a csv file.
import csv
from collections import defaultdict
An empty defaultdict dictionary is declared to hold the language model that is being extracted from the csv file.
model = defaultdict(dict)
A with file loop will open a csv file. A function named dictreader will read all the rows from a csv file. The rows are traversed and the language model is built.
with open("bigram_model.csv", mode='r', encoding='utf-8') as file:
rows = csv.DictReader(file)
for row in rows:
word1 = row["word1"]
word2 = row["word2"]
count = row["count"]
model[word1][word2] = count
The language model is converted into a dict.
model = dict(model)
A search word will need to be used.
search_word = "models"
The variable next will be equal to the result of the model.get function that uses the search word.
next = model.get(search_word, {})
The matches are printed out.
for word, count in sorted(next.items()):
print(word,count)
This is what the whole source code looks like.
import csv
from collections import defaultdict
model = defaultdict(dict)
with open("bigram_model.csv", mode='r', encoding='utf-8') as file:
rows = csv.DictReader(file)
for row in rows:
word1 = row["word1"]
word2 = row["word2"]
count = row["count"]
model[word1][word2] = count
model = dict(model)
search_word = "models"
next = model.get(search_word, {})
for word, count in sorted(next.items()):
print(word,count)
