9.3.4. Random numbers#
As a step toward Markov text generation, next we’ll choose a random sequence of words from word_counter.
But first let’s talk about randomness.
Given the same inputs, most computer programs are deterministic, which means they generate the same outputs every time. Determinism is usually a good thing, since we expect the same calculation to yield the same result. For some applications, though, we want the computer to be unpredictable. Games are one example, but there are more.
Making a program truly nondeterministic turns out to be difficult, but there are ways to fake it. One is to use algorithms that generate pseudorandom numbers. Pseudorandom numbers are not truly random because they are generated by a deterministic computation, but just by looking at the numbers it is all but impossible to distinguish them from random.
The random module provides functions that generate pseudorandom numbers – which I will simply call “random” from here on.
We can import it like this.
import sys
from pathlib import Path
# Find project root by looking for _config.yml
current = Path.cwd()
for parent in [current, *current.parents]:
if (parent / '_config.yml').exists():
project_root = parent
break
else:
project_root = Path.cwd().parent.parent
# Add project root to path
sys.path.insert(0, str(project_root))
# Import shared teaching helpers and cell magics
from shared import thinkpython, diagram, jupyturtle, structshape
from shared.download import download
import random
import unicodedata
# Text-analysis setup shared by the split sections.
data_dir = project_root / 'data'
data_dir.mkdir(parents=True, exist_ok=True)
raw_path = data_dir / 'pg43.txt'
filename = data_dir / 'dr_jekyll.txt'
if not filename.exists():
if not raw_path.exists():
download('https://www.gutenberg.org/cache/epub/43/pg43.txt', str(raw_path))
with open(raw_path, encoding='utf-8') as reader, open(filename, 'w', encoding='utf-8') as writer:
in_body = False
for line in reader:
if line.startswith('***'):
if not in_body:
in_body = True
continue
break
if in_body:
writer.write(line)
def split_line(line):
return line.replace('—', ' ').split()
punc_marks = {}
for line in open(filename, encoding='utf-8'):
for char in line:
category = unicodedata.category(char)
if category.startswith('P'):
punc_marks[char] = 1
punctuation = ''.join(punc_marks)
def clean_word(word):
return word.strip(punctuation).lower()
word_counter = {}
for line in open(filename, encoding='utf-8'):
for word in split_line(line):
word = clean_word(word)
if word:
word_counter[word] = word_counter.get(word, 0) + 1
def second_element(item):
return item[1]
def print_most_common(counter, num=5):
items = sorted(counter.items(), key=second_element, reverse=True)
for key, freq in items[:num]:
print(freq, key, sep='\t')
import random
# this cell initializes the random number generator so it
# generates the same sequence each time the notebook runs.
random.seed(4)
The random module provides a function called choice that chooses an element from a list at random, with every element having the same probability of being chosen.
t = [1, 2, 3]
random.choice(t)
1
If you call the function again, you might get the same element again, or a different one.
for i in range(10):
print(random.choice(t), end=' ')
2 1 3 2 2 1 1 1 1 2
In the long run, we expect to get every element about the same number of times.
If you use choice with a dictionary, you get a KeyError.
%%expect KeyError
random.choice(word_counter)
KeyError: 2250
To choose a random key, you have to put the keys in a list and then call choice.
words = list(word_counter)
random.choice(words)
'tragic'
If we generate a random sequence of words, it doesn’t make much sense.
for i in range(6):
word = random.choice(words)
print(word, end=' ')
withdrew powers turpitude would hitherto relations
Part of the problem is that we are not taking into account that some words are more common than others. The results will be better if we choose words with different “weights”, so that some are chosen more often than others.
If we use the values from word_counter as weights, each word is chosen with a probability that depends on its frequency.
weights = word_counter.values()
The random module provides another function called choices that takes weights as an optional argument.
random.choices(words, weights=weights)
['said']
And it takes another optional argument, k, that specifies the number of words to select.
random_words = random.choices(words, weights=weights, k=6)
random_words
['to', 'was', 'and', 'a', 'truly', 'trust']
The result is a list of strings that we can join into something that’s looks more like a sentence.
' '.join(random_words)
'to was and a truly trust'
### EXERCISE: Weighted Random Selection
import random
fruits = ['apple', 'banana', 'cherry']
weights = [10, 3, 1] # apple is most common
# 1. Use random.choices() to select 6 fruits with the given weights
# 2. Print the result
### Your code starts here:
### Your code ends here.
['banana', 'banana', 'apple', 'apple', 'apple', 'banana']
If you choose words from the book at random, you get a sense of the vocabulary, but a series of random words seldom makes sense because there is no relationship between successive words. For example, in a real sentence you expect an article like “the” to be followed by an adjective or a noun, and probably not a verb or adverb. So the next step is to look at these relationships between words.
9.3.4.1. Bigrams#
Instead of looking at one word at a time, now we’ll look at sequences of two words, which are called bigrams. A sequence of three words is called a trigram, and a sequence with some unspecified number of words is called an n-gram.
Let’s write a program that finds all of the bigrams in the book and the number of times each one appears. To store the results, we’ll use a dictionary where
The keys are tuples of strings that represent bigrams, and
The values are integers that represent frequencies.
Let’s call it bigram_counter.
bigram_counter = {}
The following function takes a list of two strings as a parameter.
First it makes a tuple of the two strings, which can be used as a key in a dictionary.
Then it adds the key to bigram_counter, if it doesn’t exist, or increments the frequency if it does.
def count_bigram(bigram):
key = tuple(bigram)
if key not in bigram_counter:
bigram_counter[key] = 1
else:
bigram_counter[key] += 1
As we go through the book, we have to keep track of each pair of consecutive words. So if we see the sequence “man is not truly one”, we would add the bigrams “man is”, “is not”, “not truly”, and so on.
To keep track of these bigrams, we’ll use a list called window, because it is like a window that slides over the pages of the book, showing only two words at a time.
Initially, window is empty.
window = []
We’ll use the following function to process the words one at a time.
def process_word(word):
window.append(word)
if len(window) == 2:
count_bigram(window)
window.pop(0)
# from collections import defaultdict
# model = defaultdict(list)
# for i in range(len(words) - 1):
# current_word = words[i]
# next_word = words[i + 1]
# model[current_word].append(next_word)
The first time this function is called, it appends the given word to window.
Since there is only one word in the window, we don’t have a bigram yet, so the function ends.
The second time it’s called – and every time thereafter – it appends a second word to window.
Since there are two words in the window, it calls count_bigram to keep track of how many times each bigram appears.
Then it uses pop to remove the first word from the window.
The following program loops through the words in the book and processes them one at a time.
for line in open(filename):
for word in split_line(line):
word = clean_word(word)
process_word(word)
The result is a dictionary that maps from each bigram to the number of times it appears.
We can use print_most_common to see the most common bigrams.
print_most_common(bigram_counter)
178 ('of', 'the')
139 ('in', 'the')
94 ('it', 'was')
80 ('and', 'the')
73 ('to', 'the')
Looking at these results, we can get a sense of which pairs of words are most likely to appear together. We can also use the results to generate random text, like this.
random.seed(42) ### just to make the random choices reproducible
bigrams = list(bigram_counter) ### list of bigrams (tuples of two words)
weights = bigram_counter.values()
random_bigrams = random.choices(bigrams, weights=weights, k=10)
bigrams is a list of the bigrams that appear in the books.
weights is a list of their frequencies, so random_bigrams is a sample where the probability a bigram is selected is proportional to its frequency.
Here are the results.
for pair in random_bigrams:
print(' '.join(pair), end=' ')
seriously amiss a man as the but i the cup but here stain of that the silence after by a
This way of generating text is better than choosing random words, but still doesn’t make a lot of sense.
### EXERCISE: Counting Bigrams
sentence = "the cat sat on the mat the cat"
# 1. Split the sentence into words
# 2. Build a bigram_counts dictionary mapping (word1, word2) tuples to their frequency
# 3. Print each bigram and its count, sorted by frequency (highest first)
### Your code starts here:
### Your code ends here.
2 ('the', 'cat')
1 ('cat', 'sat')
1 ('sat', 'on')
1 ('on', 'the')
1 ('the', 'mat')
1 ('mat', 'the')