Word Frequencies

9.3.3. Word Frequencies#

The following loop computes the frequency of each unique word that is very similar to the value_counts() function in the dictionary section when we count the letter frequency of “brontosaurus” and “mississippi”.

import sys
from pathlib import Path

# Find project root by looking for _config.yml
current = Path.cwd()
for parent in [current, *current.parents]:
    if (parent / '_config.yml').exists():
        project_root = parent
        break
else:
    project_root = Path.cwd().parent.parent

# Add project root to path
sys.path.insert(0, str(project_root))

# Import shared teaching helpers and cell magics
from shared import thinkpython, diagram, jupyturtle, structshape
from shared.download import download

import random
import unicodedata

# Text-analysis setup shared by the split sections.
data_dir = project_root / 'data'
data_dir.mkdir(parents=True, exist_ok=True)
raw_path = data_dir / 'pg43.txt'
filename = data_dir / 'dr_jekyll.txt'

if not filename.exists():
    if not raw_path.exists():
        download('https://www.gutenberg.org/cache/epub/43/pg43.txt', str(raw_path))

    with open(raw_path, encoding='utf-8') as reader, open(filename, 'w', encoding='utf-8') as writer:
        in_body = False
        for line in reader:
            if line.startswith('***'):
                if not in_body:
                    in_body = True
                    continue
                break
            if in_body:
                writer.write(line)

def split_line(line):
    return line.replace('—', ' ').split()

punc_marks = {}
for line in open(filename, encoding='utf-8'):
    for char in line:
        category = unicodedata.category(char)
        if category.startswith('P'):
            punc_marks[char] = 1

punctuation = ''.join(punc_marks)

def clean_word(word):
    return word.strip(punctuation).lower()

word_counter = {}
for line in open(filename, encoding='utf-8'):
    for word in split_line(line):
        word = clean_word(word)
        if word:
            word_counter[word] = word_counter.get(word, 0) + 1

word_counter = {}
for line in open(filename):
    for word in split_line(line):
        word = clean_word(word)
        if word not in word_counter:
            word_counter[word] = 1
        else:
            word_counter[word] += 1

The first time we see a word, we initialize its frequency to 1. If we see the same word again later, we increment its frequency.

To see which words appear most often, we can use items to get the key-value pairs from word_counter, and sort them by the second element of the pair, which is the frequency. First we’ll define a function that selects the second element just like the tuple chapter second_element().

def second_element(t):
    return t[1]

Now we can use sorted with two keyword arguments:

key=second_element means the items will be sorted according to the frequencies of the words.
reverse=True means the items will be sorted in reverse order, with the most frequent words first.

items = sorted(word_counter.items(), key=second_element, reverse=True)

Here are the five most frequent words.

for word, freq in items[:5]:
    print(freq, word, sep='\t')

the
and
of
to
i

### EXERCISE: Word Frequency Counter

text = "the cat sat on the mat and the cat sat"
# 1. Build a word frequency dictionary (word → count) for 'text'
# 2. Print the top 3 most frequent words with their counts (tab-separated)
### Your code starts here:



### Your code ends here.

the
cat
sat

In the next section, we’ll encapsulate this loop in a function. And we’ll use it to demonstrate a new feature – optional parameters.

9.3.3.1. Optional Parameters#

We’ve used built-in functions that take optional parameters. For example, round takes an optional parameters called ndigits that indicates how many decimal places to keep.

round(3.141592653589793, ndigits=3)

3.142

But it’s not just built-in functions – we can write functions with optional parameters, too. For example, the following function takes two parameters, word_counter and num.

def print_most_common(word_counter, num=5):
    items = sorted(word_counter.items(), key=second_element, reverse=True)

    for word, freq in items[:num]:
        print(freq, word, sep='\t')

The second parameter looks like an assignment statement, but it’s not – it’s an optional parameter.

If you call this function with one argument, num gets the default value, which is 5.

print_most_common(word_counter)

the
and
of
to
i

If you call this function with two arguments, the second argument gets assigned to num instead of the default value.

print_most_common(word_counter, 3)

the
and
of

In that case, we would say the optional argument overrides the default value.

If a function has both required and optional parameters, all of the required parameters have to come first, followed by the optional ones.

%%expect SyntaxError
def bad_function(n=5, word_counter):
    return None

  Cell In[13], line 1
    def bad_function(n=5, word_counter):
                          ^
SyntaxError: parameter without a default follows parameter with a default

### EXERCISE: Function with Optional Parameter

word_counter = {'the': 3, 'cat': 2, 'sat': 2, 'on': 1, 'mat': 1, 'and': 1}

# write a function called print_least_common that takes a word_counter dictionary
# and an optional parameter num (default 3)
# the function should print the num least frequent words in the word_counter,
# one per line, with the frequency and word separated by a tab
### Your code starts here:

### Your code ends here.

on
mat
and

on
mat
and
cat
sat

9.3.3.2. Dictionary Subtraction#

Suppose we want to spell-check a book – that is, find a list of words that might be misspelled. One way to do that is to find words in the book that don’t appear in a list of valid words. Now we’ll use this list to spell-check Robert Louis Stevenson. We can think of this problem as set subtraction – that is, we want to find all the words from one set (the words in the book) that are not in the other (the words in the list).

The following cell downloads the word list.

(*I am using ../../data as the data folder and your data folder maybe different from mine if you download the notebook and place it in your project folder. In that case you do not have to specify the data folder if you have the words.txt downloaded into the same folder as the notebook.)

if __name__ == '__main__':  ### This is a common Python idiom that checks if the script is being run directly (as the main program) rather than imported as a module. If this condition is true, the code inside this block will be executed.
    print_most_common(word_counter)
    download('https://raw.githubusercontent.com/AllenDowney/ThinkPython/v3/words.txt', '../../data');

the
cat
sat
on
mat

We can read the contents of words.txt and split it into a list of strings.

### This is a common way to read a file and split it into a list of words. 
### However, it does not properly close the file after reading, which can 
### lead to resource leaks. Using a with statement is a better practice as 
### it ensures that the file is properly closed after its suite finishes, 
### even if an error occurs.
# word_list = open('../../data/words.txt').read().split()   

with open('../../data/words.txt') as f:
    word_list = f.read().split()

Then we’ll store the words as keys in a dictionary so we can use the in operator to check quickly whether a word is valid.

valid_words = {}            ### another dictionary to store the valid words from the word list
for word in word_list:
    valid_words[word] = 1

Now, to identify words that appear in the book but not in the word list, we’ll use subtract, which takes two dictionaries as parameters and returns a new dictionary that contains all the keys from one that are not in the other.

def subtract(d1, d2):
    res = {}
    for key in d1:
        if key not in d2:
            res[key] = d1[key]
    return res

Here’s how we use it.

diff = subtract(word_counter, valid_words)

To get a sample of words that might be misspelled, we can print the most common words in diff.

print_most_common(diff)

The most common “misspelled” words are mostly names and a few single-letter words (Mr. Utterson is Dr. Jekyll’s friend and lawyer).

If we select words that only appear once, they are more likely to be actual misspellings. We can do that by looping through the items and making a list of words with frequency 1.

singletons = []
for word, freq in diff.items():
    if freq == 1:
        singletons.append(word)

Here are the last few elements of the list.

singletons[-5:]

[]

Most of them are valid words that are not in the word list. But 'reindue' appears to be a misspelling of 'reinduce', so at least we found one legitimate error.

### EXERCISE: Dictionary Subtraction

d1 = {'apple': 3, 'banana': 1, 'cherry': 4, 'date': 2}
d2 = {'banana': 1, 'cherry': 4}
# Use subtract(d1, d2) to find keys in d1 that are not in d2.
# Print the resulting dictionary.
### Your code starts here:



### Your code ends here.

{'apple': 3, 'date': 2}

Word Frequencies

Contents

9.3.3. Word Frequencies#

9.3.3.1. Optional Parameters#

9.3.3.2. Dictionary Subtraction#