dedtech.info

Information about computer technology.

Generate Random Text From Bigrams With Python

This blog post will explain how to generate random text from bigrams with python.

The textblob library has to be imported to parse and tokenize sentences. The nltk library has to be imported to create the bigrams from tokenized sentences. The defaultdict library has to be imported from collections to store the bigram language model. The random library has to be imported to take a random value out of a list when the sentence is being generated.

from textblob import TextBlob
from nltk.util import ngrams
from collections import defaultdict
import random

The generate function that creates the sentences from a single word is simpler than it looks. The function takes the language model, a word, and the generated sentence length amount as parameters. An empty list named sentence is declared. Then a for loop is ran the amount of times according to the desired length of the generated sentence. The list sentence has the starting word appended to it. The language model has a built in function that will retrieve a list of the next words. Then a random value is taken out of the previousy mentioned list. When the loop starts over, that word will be appended to the sentences list. When the value is returned, the join function will convert the sentence list to a string with spaces.

def generate(model, word, num):
    sentence = []
    for _ in range(num):
        sentence.append(word)
        next = model.get(word, {})
        word = random.choice(list(next.items()))[0]
    return ' '.join(sentence)

This is the text corpus used for this example.

text = """TextBlob is very amazing and simple to use.
What a great tool! That tool helps with simple tasks.
Tokenization is easy with that tool.
"""

An empty list is declared to hold the bigrams.

bigrams = []

A textblob object is declared to process the text corput.

blob = TextBlob(text)

Two defaultdict objects are declared to hold bigram counts and the bigram language model.

counts = defaultdict(int)
model = defaultdict(dict)

The sentences are parsed and tokenized and converted to bigrams.

for sentence in blob.sentences:
    bigrams += list(ngrams(sentence.words,2))

The bigrams are counted and stored in a defaultdict object.

for bigram in bigrams:
    counts[bigram] += 1

The language model is created by organizing and storing bigrams and their counts in a defaultdict object.

for bigram, num in counts.items():
    word1 = bigram[0]
    word2 = bigram[1]
    count = num
    model[word1][word2] = count

The language model is redeclared as a dict object instead of a defaultdict object.

model = dict(model)

This will be the word the generated sentence starts with.

word = "is"

The results of the function will be printed to the screen,

print(generate(model, word, 5))

This is what the whole source code looks like.

from textblob import TextBlob
from nltk.util import ngrams
from collections import defaultdict
import random

def generate(model, word, num):
    sentence = []
    for _ in range(num):
        sentence.append(word)
        next = model.get(word, {})
        word = random.choice(list(next.items()))[0]
    return ' '.join(sentence)

text = """TextBlob is very amazing and simple to use.
What a great tool! That tool helps with simple tasks.
Tokenization is easy with that tool.
"""

bigrams = []

blob = TextBlob(text)

counts = defaultdict(int)
model = defaultdict(dict)

for sentence in blob.sentences:
    bigrams += list(ngrams(sentence.words,2))

for bigram in bigrams:
    counts[bigram] += 1

for bigram, num in counts.items():
    word1 = bigram[0]
    word2 = bigram[1]
    count = num
    model[word1][word2] = count
    
model = dict(model)

word = "is"

print(generate(model, word, 5))

Leave a Reply

Your email address will not be published. Required fields are marked *