9.3. Text Analysis#
At this point we have covered Python’s core data structures – lists, dictionaries, and tuples – and some algorithms that use them. In this chapter, we’ll use them to explore text analysis and Markov generation:
Text analysis is a way to describe the statistical relationships between the words in a document, like the probability that one word is followed by another, and
Markov generation is a way to generate new text with words and phrases similar to the original text.
These algorithms are similar to parts of a Large Language Model (LLM), which is the key component of a chatbot.
We’ll start by counting the number of times each word appears in a book. Then we’ll look at pairs of words, and make a list of the words that can follow each word. We’ll make a simple version of a Markov generator, and as an exercise, you’ll have a chance to make a more general version.
9.3.1. Unique words#
As a first step toward text analysis, let’s read a book – The Strange Case Of Dr. Jekyll And Mr. Hyde by Robert Louis Stevenson – and count the number of unique words. Instructions for downloading the book are in the notebook for this chapter.
The following cell downloads the book from Project Gutenberg.
data_dir = project_root / 'data'
data_dir.mkdir(parents=True, exist_ok=True) # Create the data directory if it doesn't exist
raw_path = data_dir / 'pg43.txt' ### This is the raw text file downloaded from Project Gutenberg
clean_path = data_dir / 'dr_jekyll.txt' ### This will the cleaned text file that we will use for analysis
if not raw_path.exists():
download('https://www.gutenberg.org/cache/epub/43/pg43.txt', str(raw_path))
print('Downloaded to', raw_path)
else:
print('Already downloaded:', raw_path)
Already downloaded: /Users/tcn85/workspace/py/data/pg43.txt
The version available from Project Gutenberg includes information about the book at the beginning and license information at the end.
We’ll use clean_file from Chapter 8 to remove this material and write a “clean” file that contains only the text of the book.
def is_special_line(line):
return line.strip().startswith('*** ') ### This is the marker for the start and end of the actual text
def clean_file(input_file, output_file):
reader = open(input_file, encoding='utf-8') # Open the input file for reading with UTF-8 encoding
writer = open(output_file, 'w') # Open the output file for writing
# reader and writer are file objects that we can use to read from and write to the files, respectively
# read/write operations are line by line, so we can use a for loop to iterate through the lines
for line in reader:
if is_special_line(line): ### This is the marker for the start and end of the actual text
break ### Stop reading until we find the start of the actual text
for line in reader:
if is_special_line(line): ### This is the marker for the start and end of the actual text
break
writer.write(line) ### Write the line to the output file if it's not a special line
reader.close() # Close the input file
writer.close() # Close the output file
### using with statement to automatically close files
# with open(input_file, encoding='utf-8') as reader, open(output_file, 'w') as writer:
# for line in reader:
# if is_special_line(line):
# break
# for line in reader:
# if is_special_line(line):
# break
# writer.write(line)
filename = clean_path ### 'drjekyll.txt' will be the cleaned whole text
clean_file(raw_path, filename) ### read from pg43.txt,
### write to dr_jekyll.txt
count = 0 ### avoid reading the entire file
for line in open(filename):
print(line, end='')
count += 1
if count > 20: ### read the first 20 lines
break
The Strange Case Of Dr. Jekyll And Mr. Hyde
by Robert Louis Stevenson
Contents
STORY OF THE DOOR
SEARCH FOR MR. HYDE
DR. JEKYLL WAS QUITE AT EASE
THE CAREW MURDER CASE
INCIDENT OF THE LETTER
INCIDENT OF DR. LANYON
We’ll use a for loop to read lines from the file and split to divide the lines into words.
Then, to keep track of unique words, we’ll store each word as a key in a dictionary.
unique_words = {}
for line in open(filename): ### filename is dr_jekyll.txt, cleaned text
seq = line.split()
for word in seq:
unique_words[word] = 1
len(unique_words)
# unique_words
6042
The length of the dictionary is the number of unique words – about 6000 by this way of counting.
But if we inspect them, we’ll see that some are not valid words.
For example, let’s look at the longest words in unique_words.
We can use sorted to sort the words, passing the len function as a keyword argument so the words are sorted by length.
sorted(unique_words, key=len)[-5:]
['chocolate-coloured',
'superiors—behold!”',
'coolness—frightened',
'gentleman—something',
'pocket-handkerchief.']
The slice index, [-5:], selects the last 5 elements of the sorted list, which are the longest words.
The list includes some legitimately long words, like “circumscription”, and some hyphenated words, like “chocolate-coloured”. But some of the longest “words” are actually two words separated by a dash. And other words include punctuation like periods, exclamation points, and quotation marks.
So, before we move on, let’s deal with dashes and other punctuation.
### EXERCISE: Counting Unique Words
text = "to be or not to be that is the question to be"
# 1. Build a dictionary where each key is a unique word from 'text'
# 2. Print the total number of unique words
### Your code starts here:
### Your code ends here.
8