This blog post will explain how to save a language model to a JSON file with Python. A language model is something that can be used to generate random text. It is easy to store a simple statistical language model in a JSON file.
The JSON library has to be imported for the program to be able to read a JSON file. The ngrams library from nltk has to be imported so the bigrams can be extracted from the tokenized text corpus. The defaultdict library from collections has to be imported so the program can store the language model in a JSON file.
import json
from nltk.util import ngrams
from collections import defaultdict
Empty list that will hold the data that is written to the JSON file.
data = []
Define the text corpus used by this example.
corpus = "how is it going today how are you hello how is it going"
Tokenize the text corpus.
tokens = corpus.split(' ')
Generate bigrams for building language model.
bigrams = ngrams(tokens, 2)
Define a defaultdict(int) object to hold the bigram language model.
bigram_counts = defaultdict(int)
Build bigram language model.
for bigram in bigrams:
bigram_counts[bigram] += 1
The list of dictionaries is written to the list variable data.
for bigram, count in bigram_counts.items():
data.append({"word1":bigram[0], "word2":bigram[1], "count":count})
The contents of the variable data are written to a JSON file.
with open("result.json", 'w') as file:
json.dump(data, file, indent=1)
This is what the whole source code looks like.
import json
from nltk.util import ngrams
from collections import defaultdict
data = []
corpus = "how is it going today how are you hello how is it going"
tokens = corpus.split(' ')
bigrams = ngrams(tokens, 2)
bigram_counts = defaultdict(int)
for bigram in bigrams:
bigram_counts[bigram] += 1
for bigram, count in bigram_counts.items():
data.append({"word1":bigram[0], "word2":bigram[1], "count":count})
with open("result.json", 'w') as file:
json.dump(data, file, indent=1)
