dedtech.info

Information about computer technology.

Save A Bigram Language Model To A CSV File With Python

This blog post will explain how to save a language model to a CSV file with Python. A language model is something that can be used to generate random text. It is easy to store a simple statistical language model in a csv file.

The csv library has to be imported for the program to be able to read a csv file. The re library has to be imported for the tokenizer to strip punctuation off the text corpus. The defaultdict library from collections has to be imported so the program can store the language model that is read from a csv file.

import csv
import re
from collections import defaultdict

The function below will generate bigrams from a list of tokens.

def generate_ngrams(tokens,n):
    ngrams = []
    for i in range(len(tokens) - (n - 1)):
        ngrams.append(tokens[i:i+n])
    return ngrams

There is a file that the bigram language model is saved to.

filename = "bigram_counts.csv"

Define the text corpus used by this example

text = "Natural language processing is fun. Language models can be statistical or neural."

Define a defaultdict(int) object to hold the bigram language model.

bigram_counts = defaultdict(int)

Tokenize the text corpus.

clean = re.sub(r'[^\w\s]', '', text).lower()
tokens = clean.split(' ')

Generate bigrams for building language model.

bigrams = generate_ngrams(tokens,2)

Build bigram language model.

for bigram in bigrams:
    bigram_counts[tuple(bigram)] += 1

Save bigram language model to a csv file.

with open(filename, mode="w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["word1", "word2", "count"])
    for bigram, count in bigram_counts.items():
        writer.writerow([bigram[0],bigram[1],count])

This is what the whole source code looks like.

import csv
import re
from collections import defaultdict

def generate_ngrams(tokens,n):
    ngrams = []
    for i in range(len(tokens) - (n - 1)):
        ngrams.append(tokens[i:i+n])
    return ngrams

filename = "bigram_model.csv"

text = "Natural language processing is fun. Language models can be statistical or neural."

bigram_counts = defaultdict(int)

clean = re.sub(r'[^\w\s]', '', text).lower()
tokens = clean.split(' ')

bigrams = generate_ngrams(tokens,2)

for bigram in bigrams:
    bigram_counts[tuple(bigram)] += 1

with open(filename, mode="w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["word1", "word2", "count"])
    for bigram, count in bigram_counts.items():
        writer.writerow([bigram[0],bigram[1],count])

Leave a Reply

Your email address will not be published. Required fields are marked *