Punctuation

9.3.2. Punctuation#

To identify the words in the text, we need to deal with two issues:

  • When a dash appears in a line, we should replace it with a space – then when we use split, the words will be separated.

  • After splitting the words, we can use strip to remove punctuation.

To handle the first issue, we can use the following function, which takes a string, replaces dashes with spaces, splits the string, and returns the resulting list.

import sys
from pathlib import Path

# Find project root by looking for _config.yml
current = Path.cwd()
for parent in [current, *current.parents]:
    if (parent / '_config.yml').exists():
        project_root = parent
        break
else:
    project_root = Path.cwd().parent.parent

# Add project root to path
sys.path.insert(0, str(project_root))

# Import shared teaching helpers and cell magics
from shared import thinkpython, diagram, jupyturtle, structshape
from shared.download import download
import random
import unicodedata

# Text-analysis setup shared by the split sections.
data_dir = project_root / 'data'
data_dir.mkdir(parents=True, exist_ok=True)
raw_path = data_dir / 'pg43.txt'
filename = data_dir / 'dr_jekyll.txt'

if not filename.exists():
    if not raw_path.exists():
        download('https://www.gutenberg.org/cache/epub/43/pg43.txt', str(raw_path))

    with open(raw_path, encoding='utf-8') as reader, open(filename, 'w', encoding='utf-8') as writer:
        in_body = False
        for line in reader:
            if line.startswith('***'):
                if not in_body:
                    in_body = True
                    continue
                break
            if in_body:
                writer.write(line)

def split_line(line):
    return line.replace('—', ' ').split()

punc_marks = {}
for line in open(filename, encoding='utf-8'):
    for char in line:
        category = unicodedata.category(char)
        if category.startswith('P'):
            punc_marks[char] = 1

punctuation = ''.join(punc_marks)

def clean_word(word):
    return word.strip(punctuation).lower()

word_counter = {}
for line in open(filename, encoding='utf-8'):
    for word in split_line(line):
        word = clean_word(word)
        if word:
            word_counter[word] = word_counter.get(word, 0) + 1
def split_line(line):
    return line.replace('—', ' ').split()

Notice that split_line only replaces dashes, not hyphens. Here’s an example.

split_line('coolness—frightened')
['coolness', 'frightened']

Now, to remove punctuation from the beginning and end of each word, we can use strip, but we need a list of characters that are considered punctuation.

Characters in Python strings are in Unicode, which is an international standard used to represent letters in nearly every alphabet, numbers, symbols, punctuation marks, and more. The unicodedata module provides a category function we can use to tell which characters are punctuation. Given a letter, it returns a string with information about what category the letter is in.

import unicodedata

unicodedata.category('A')
'Lu'

The category string of 'A' is 'Lu' – the 'L' means it is a letter and the 'u' means it is uppercase.

The category string of '.' is 'Po' – the 'P' means it is punctuation and the 'o' means its subcategory is “other”.

unicodedata.category('.')
'Po'

We can find the punctuation marks in the book by checking for characters with categories that begin with 'P'. The following loop stores the unique punctuation marks in a dictionary.

punc_marks = {}
for line in open(filename):         ### filename is dr_jekyll.txt, cleaned text
    for char in line:
        category = unicodedata.category(char)
        if category.startswith('P'):
            punc_marks[char] = 1

To make a list of punctuation marks, we can join the keys of the dictionary into a string.

punctuation = ''.join(punc_marks)
print(punctuation)
.’;,-“”:?—‘!()_

Now that we know which characters in the book are punctuation, we can write a function that takes a word, strips punctuation from the beginning and end, and converts it to lower case.

def clean_word(word):
    return word.strip(punctuation).lower()

Here’s an example.

clean_word('“Behold!”')
'behold'

Because strip removes characters from the beginning and end, it leaves hyphenated words alone.

clean_word('pocket-handkerchief')
'pocket-handkerchief'

Now here’s a loop that uses split_line and clean_word to identify the unique words in the book.

unique_words2 = {}
for line in open(filename):
    for word in split_line(line):   ### split_line handles the em dash
        word = clean_word(word)      ### removes punctuation, lowercase  
        unique_words2[word] = 1

len(unique_words2)
4005

With this stricter definition of what a word is, there are about 4000 unique words. And we can confirm that the list of longest words has been cleaned up.

key=len tells sorted() to sort by the length of each word.It calls len() on each word and sorts by that number from shortest to longest by default.

sorted(unique_words2, key=len)[-5:]
['circumscription',
 'unimpressionable',
 'fellow-creatures',
 'chocolate-coloured',
 'pocket-handkerchief']
### EXERCISE: Cleaning Words

words = ['"Hello,"', 'world!', 'chocolate-coloured', '"Behold!"', 'pocket-handkerchief']
# Use clean_word() to process each word and print the cleaned version.
# Note: clean_word() strips punctuation from both ends and lowercases.
### Your code starts here:



### Your code ends here.

Hide code cell source

# Solution
words = ['"Hello,"', 'world!', 'chocolate-coloured', '"Behold!"', 'pocket-handkerchief']
for word in words:
    print(clean_word(word))
"hello,"
world
chocolate-coloured
"behold!"
pocket-handkerchief

Now let’s see how many times each word is used.