Markov analysis

9.3.5. Markov analysis#

We can do better with Markov chain text analysis, which computes, for each word in a text, the list of words that come next. As an example, we’ll analyze these lyrics from the Monty Python song Eric, the Half a Bee:

import sys
from pathlib import Path

# Find project root by looking for _config.yml
current = Path.cwd()
for parent in [current, *current.parents]:
    if (parent / '_config.yml').exists():
        project_root = parent
        break
else:
    project_root = Path.cwd().parent.parent

# Add project root to path
sys.path.insert(0, str(project_root))

# Import shared teaching helpers and cell magics
from shared import thinkpython, diagram, jupyturtle, structshape
from shared.download import download
import random
import unicodedata

# Text-analysis setup shared by the split sections.
data_dir = project_root / 'data'
data_dir.mkdir(parents=True, exist_ok=True)
raw_path = data_dir / 'pg43.txt'
filename = data_dir / 'dr_jekyll.txt'

if not filename.exists():
    if not raw_path.exists():
        download('https://www.gutenberg.org/cache/epub/43/pg43.txt', str(raw_path))

    with open(raw_path, encoding='utf-8') as reader, open(filename, 'w', encoding='utf-8') as writer:
        in_body = False
        for line in reader:
            if line.startswith('***'):
                if not in_body:
                    in_body = True
                    continue
                break
            if in_body:
                writer.write(line)

def split_line(line):
    return line.replace('—', ' ').split()

punc_marks = {}
for line in open(filename, encoding='utf-8'):
    for char in line:
        category = unicodedata.category(char)
        if category.startswith('P'):
            punc_marks[char] = 1

punctuation = ''.join(punc_marks)

def clean_word(word):
    return word.strip(punctuation).lower()

word_counter = {}
for line in open(filename, encoding='utf-8'):
    for word in split_line(line):
        word = clean_word(word)
        if word:
            word_counter[word] = word_counter.get(word, 0) + 1
song = """
Half a bee, philosophically,
Must, ipso facto, half not be.
But half the bee has got to be
Vis a vis, its entity. D'you see?
"""

To store the results, we’ll use a dictionary that maps from each word to the list of words that follow it.

successor_map = {}

As an example, let’s start with the first two words of the song.

first = 'half'
second = 'a'

If the first word is not in successor_map, we have to add a new item that maps from the first word to a list containing the second word.

successor_map[first] = [second]
successor_map
{'half': ['a']}

If the first word is already in the dictionary, we can look it up to get the list of successors we’ve seen so far, and append the new one.

first = 'half'
second = 'not'

successor_map[first].append(second)
successor_map
{'half': ['a', 'not']}

The following function encapsulates these steps.

def add_bigram(bigram):
    first, second = bigram
    
    if first not in successor_map:
        successor_map[first] = [second]
    else:
        successor_map[first].append(second)

If the same bigram appears more that once, the second word is added to the list more than once. In this way, successor_map keeps track of how many times each successor appears.

As we did in the previous section, we’ll use a list called window to store pairs of consecutive words. And we’ll use the following function to process the words one at a time.

def process_word_bigram(word):
    window.append(word)
    
    if len(window) == 2:
        add_bigram(window)
        window.pop(0)

Here’s how we use it to process the words in the song.

successor_map = {}
window = []

for word in song.split():
    word = clean_word(word)
    process_word_bigram(word)

And here are the results.

successor_map
{'half': ['a', 'not', 'the'],
 'a': ['bee', 'vis'],
 'bee': ['philosophically', 'has'],
 'philosophically': ['must'],
 'must': ['ipso'],
 'ipso': ['facto'],
 'facto': ['half'],
 'not': ['be'],
 'be': ['but', 'vis'],
 'but': ['half'],
 'the': ['bee'],
 'has': ['got'],
 'got': ['to'],
 'to': ['be'],
 'vis': ['a', 'its'],
 'its': ['entity'],
 'entity': ["d'you"],
 "d'you": ['see']}

The word 'half' can be followed by 'a', 'not', or 'the'. The word 'a' can be followed by 'bee' or 'vis'. Most of the other words appear only once, so they are followed by only a single word.

Now let’s analyze the book.

successor_map = {}
window = []

for line in open(filename):
    for word in split_line(line):
        word = clean_word(word)
        process_word_bigram(word)
        
        
# from collections import defaultdict
# model = defaultdict(list)

# for i in range(len(words) - 1):
#     current_word = words[i]
#     next_word = words[i + 1]
#     model[current_word].append(next_word)

We can look up any word and find the words that can follow it.

# I used this cell to find a predecessor with a good number of 
# possible successors and at least one repeated word.

def has_duplicates(t):
    return len(set(t)) < len(t)         ### A set is a collection of unique 
                                        ### elements, so if the length of 
                                        # the set is less than the length 
                                        # of the original list, it means 
                                        # there are duplicates in the list.

for key, value in successor_map.items():
    if len(value) == 7 and has_duplicates(value):
        print(key, value)
story ['of', 'of', 'indeed', 'but', 'for', 'of', 'that']
incident ['of', 'of', 'at', 'of', 'of', 'at', 'this']
lanyon’s ['narrative', 'there', 'face', 'manner', 'narrative', 'the', 'condemnation']
common ['it', 'interest', 'friends', 'friends', 'observers', 'but', 'quarry']
relief ['the', 'to', 'when', 'that', 'that', 'of', 'it']
appearance ['of', 'well', 'something', 'he', 'amply', 'of', 'which']
going ['east', 'in', 'to', 'to', 'up', 'to', 'of']
till ['at', 'the', 'i', 'yesterday', 'the', 'that', 'weariness']
walk ['and', 'into', 'with', 'all', 'steadfastly', 'attired', 'with']
sounds ['nothing', 'carried', 'out', 'the', 'of', 'of', 'with']
really ['like', 'damnable', 'can', 'a', 'a', 'not', 'be']
does ['not', 'not', 'indeed', 'not', 'the', 'not', 'not']
reply ['i', 'but', 'whose', 'i', 'some', 'that’s', 'i']
continued ['mr', 'the', 'the', 'the', 'the', 'poole', 'utterson']
seems ['scarcely', 'hardly', 'to', 'she', 'much', 'he', 'to']
walked ['on', 'over', 'some', 'was', 'on', 'with', 'fast']
that’s ['a', 'it', 'talking', 'very', 'not', 'such', 'not']
although ['i', 'a', 'it', 'the', 'we', 'they', 'i']
until ['the', 'the', 'they', 'they', 'the', 'to-morrow', 'i']
disappearance ['or', 'the', 'of', 'of', 'here', 'and', 'but']
step ['into', 'or', 'back', 'natural', 'into', 'of', 'leaping']
wish ['the', 'you', 'to', 'you', 'to', 'i', 'to']
aware ['of', 'of', 'of', 'that', 'jekyll’s', 'that', 'of']
thank ['you', 'you', 'you', 'you', 'you', 'god', 'you']
maid ['servant', 'described', 'fainted', 'had', 'calls', 'had', 'lifted']
besides ['were', 'was', 'for', 'a', 'with', 'with', 'which']
observed ['the', 'utterson', 'with', 'the', 'that', 'that', 'that']
among ['other', 'the', 'the', 'the', 'my', 'my', 'temptations']
successor_map['going']
['east', 'in', 'to', 'to', 'up', 'to', 'of']

In this list of successors, notice that the word 'to' appears three times – the other successors only appear once.

### EXERCISE: Building a Successor Map

text = "alice was nice alice was clever alice was kind"
# 1. Split the text into words
# 2. Build a successor_map where each word maps to the list of words that follow it
# 3. Print the map and verify that 'alice' maps to ['was', 'was', 'was']
### Your code starts here:



### Your code ends here.

Hide code cell source

# Solution
text = "alice was nice alice was clever alice was kind"
words_list = text.split()
smap = {}
for i in range(len(words_list) - 1):
    first, second = words_list[i], words_list[i + 1]
    if first not in smap:
        smap[first] = [second]
    else:
        smap[first].append(second)
print(smap)
# 'alice' → ['was', 'was', 'was'], 'was' → ['nice', 'clever', 'kind']
{'alice': ['was', 'was', 'was'], 'was': ['nice', 'clever', 'kind'], 'nice': ['alice'], 'clever': ['alice']}

9.3.5.1. Generating text#

We can use the results from the previous section to generate new text with the same relationships between consecutive words as in the original. Here’s how it works:

  • Starting with any word that appears in the text, we look up its possible successors and choose one at random.

  • Then, using the chosen word, we look up its possible successors, and choose one at random.

We can repeat this process to generate as many words as we want. As an example, let’s start with the word 'although'. Here are the words that can follow it.

word = 'although'
successors = successor_map[word]
successors
['i', 'a', 'it', 'the', 'we', 'they', 'i']
# this cell initializes the random number generator so it 
# starts at the same point in the sequence each time this
# notebook runs.

random.seed(2)

We can use choice to choose from the list with equal probability.

word = random.choice(successors)
word
'i'

If the same word appears more than once in the list, it is more likely to be selected.

Repeating these steps, we can use the following loop to generate a longer series.

for i in range(10):
    successors = successor_map[word]
    word = random.choice(successors)
    print(word, end=' ')
continue to hesitate and swallowed the smile withered from that 

The result sounds more like a real sentence, but it still doesn’t make much sense.

We can do better using more than one word as a key in successor_map. For example, we can make a dictionary that maps from each bigram – or trigram – to the list of words that come next.

### EXERCISE: Generate Text from Successor Map

import random
random.seed(7)
# Using the successor_map built from Dr. Jekyll and Mr. Hyde:
# 1. Start with the word 'jekyll'
# 2. Generate a sequence of 8 more words by repeatedly looking up successors
#    and choosing one at random with random.choice()
# 3. Print the full generated sequence as a sentence
### Your code starts here:



### Your code ends here.

Hide code cell source

# Solution
import random
random.seed(7)
word = 'jekyll'
result = [word]
for i in range(8):
    if word in successor_map and successor_map[word]:
        word = random.choice(successor_map[word])
        result.append(word)
    else:
        break
print(' '.join(result))
jekyll god cried the situation tell you say is