9.2.6. Applications#

9.2.6.1. Cleaning Text#

Before we can search the text of Dracula, we need to download it from Project Gutenberg and remove the header and footer information.

import sys
from pathlib import Path

# Find project root by looking for _config.yml
current = Path.cwd()
for parent in [current, *current.parents]:
    if (parent / '_config.yml').exists():
        project_root = parent
        break
else:
    project_root = Path.cwd().parent.parent

# Add project root to path
sys.path.insert(0, str(project_root))

# Import shared teaching helpers and cell magics
from shared import thinkpython, diagram, jupyturtle, structshape
from shared.download import download
import re

We’ll download the Dracula text from Project Gutenberg and save it to the data folder. Then we’ll clean the file and save the cleaned version in the same folder. All subsequent analysis will use these files.

from pathlib import Path
from urllib.request import urlretrieve

data_dir = project_root / 'data'
data_dir.mkdir(parents=True, exist_ok=True)

# Download Dracula text to the project data folder
url = 'https://www.gutenberg.org/files/345/345-0.txt'
raw_path = data_dir / 'pg345.txt'
clean_path = data_dir / 'pg345_cleaned.txt'
if not raw_path.exists():
    urlretrieve(url, raw_path)
    print('Downloaded Dracula to', raw_path)
else:
    print('Dracula already downloaded:', raw_path)
Dracula already downloaded: /Users/tcn85/workspace/py/data/pg345.txt
# download('https://www.gutenberg.org/cache/epub/345/pg345.txt');
def clean_file(infile, outfile):
    """Read infile, write to outfile skipping special lines."""
    with open(infile, encoding='utf8') as fin, open(outfile, 'w', encoding='utf8') as fout:
        for line in fin:
            if not is_special_line(line):
                fout.write(line)
def clean_file(input_file, output_file):
    reader = open(input_file, encoding='utf-8')
    writer = open(output_file, 'w')

    for line in reader:
        if is_special_line(line):
            break

    for line in reader:
        if is_special_line(line):
            break
        writer.write(line)
        
    reader.close()
    writer.close()
def is_special_line(line):
    """Return True if the line marks the start or end of the Gutenberg content."""
    return line.startswith('***')
# def is_special_line(line):
#     return line.strip().startswith('*** ')
# Clean the Dracula text and save to data/pg345_cleaned.txt
clean_file(raw_path, clean_path)
print('Cleaned file saved to', clean_path)
Cleaned file saved to /Users/tcn85/workspace/py/data/pg345_cleaned.txt

Putting all that together, here’s a function that loops through the lines in the book until it finds one that matches the given pattern, and returns the Match object.

def find_first(pattern, path=clean_path):
    with open(path, encoding='utf8') as f:
        for line in f:
            result = re.search(pattern, line)
            if result is not None:
                return result

We can use it to find the first mention of a character.

result = find_first('Harker')
result.string
'CHAPTER I. Jonathan Harker’s Journal\n'

For this example, we didn’t have to use regular expressions – we could have done the same thing more easily with the in operator. But regular expressions can do things the in operator cannot.

For example, if the pattern includes the vertical bar character, '|', it can match either the sequence on the left or the sequence on the right. Suppose we want to find the first mention of Mina Murray in the book, but we are not sure whether she is referred to by first name or last. We can use the following pattern, which matches either name.

pattern = 'Mina|Murray'
result = find_first(pattern)
result.string
'CHAPTER V. Letters—Lucy and Mina\n'

We can use a pattern like this to see how many times a character is mentioned by either name. Here’s a function that loops through the book and counts the number of lines that match the given pattern.

def count_matches(pattern, path=clean_path):
    count = 0
    with open(path, encoding='utf8') as f:
        for line in f:
            result = re.search(pattern, line)
            if result is not None:
                count += 1
    return count

Now let’s see how many times Mina is mentioned.

count_matches('Mina|Murray')
229

The special character '^' matches the beginning of a string, so we can find a line that starts with a given pattern.

result = find_first('^Dracula')
result.string
'Dracula, jumping to his feet, said:--\n'

And the special character '$' matches the end of a string, so we can find a line that ends with a given pattern (ignoring the newline at the end).

result = find_first('Harker$')
result.string
'by five o’clock, we must start off; for it won’t do to leave Mrs. Harker\n'
### EXERCISE: Download and Clean Text
# Difficulty: Intermediate
# 1. Use raw_path and clean_path to print whether each file exists
# 2. If clean_path does not exist, run clean_file(raw_path, clean_path)
# 3. Print the size (in bytes) of clean_path
### Your code starts here:



### Your code ends here.

Hide code cell source

# Solution
print(raw_path.exists(), clean_path.exists())
if not clean_path.exists():
    clean_file(raw_path, clean_path)
print(clean_path.stat().st_size)
True True
855112

9.2.6.2. String substitution#

Bram Stoker was born in Ireland, and when Dracula was published in 1897, he was living in England. So we would expect him to use the British spelling of words like “centre” and “colour”. To check, we can use the following pattern, which matches either “centre” or the American spelling “center”.

pattern = 'cent(er|re)'

In this pattern, the parentheses enclose the part of the pattern the vertical bar applies to. So this pattern matches a sequence that starts with 'cent' and ends with either 'er' or 're'.

result = find_first(pattern)
result.string
'horseshoe of the Carpathians, as if it were the centre of some sort of\n'

As expected, he used the British spelling.

We can also check whether he used the British spelling of “colour”. The following pattern uses the special character '?', which means that the previous character is optional.

pattern = 'colou?r'

This pattern matches either “colour” with the 'u' or “color” without it.

result = find_first(pattern)
line = result.string
line
'undergarment with long double apron, front, and back, of coloured stuff\n'

Again, as expected, he used the British spelling.

Now suppose we want to produce an edition of the book with American spellings. We can use the sub function in the re module, which does string substitution.

re.sub(pattern, 'color', line)
'undergarment with long double apron, front, and back, of colored stuff\n'

The first argument is the pattern we want to find and replace, the second is what we want to replace it with, and the third is the string we want to search. In the result, you can see that “colour” has been replaced with “color”.

# I used this function to search for lines to use as examples

def all_matches(pattern, path=clean_path):
    with open(path, encoding='utf8') as f:
        for line in f:
            result = re.search(pattern, line)
            if result:
                print(line.strip())
### e.g., 

all_matches('weather')
weather. As I stood, the driver jumped again into his seat and shook the
weatherworn, was still complete; but it was evidently many a day since
it is a buoy with a bell, which swings in bad weather, and sends in a
am awakened by her moving about the room. Fortunately, the weather is so
learn the weather signs. To-day is a grey day, and the sun as I write is
experienced here, with results both strange and unique. The weather had
kept watch on weather signs from the East Cliff, foretold in an emphatic
_22 July_.--Rough weather last three days, and all hands busy with
weather. Passed Gibralter and out through Straits. All well.
and entering on the Bay of Biscay with wild weather ahead, and yet last
weather influences as we know that the Count can bring to bear; and if
that I am fully armed as there may be wolves; the weather is getting
# Here's the pattern I used (which uses some features we haven't seen)

# names = r'(?<!\.\s)[A-Z][a-zA-Z]+'

# all_matches(names)
### EXERCISE: String Substitution
# Difficulty: Intermediate
sample = "The colour of the city centre changed overnight."
# 1. Replace British spellings with American spellings using regex:
#    colour -> color, centre -> center
# 2. Print the transformed sentence
### Your code starts here:



### Your code ends here.

Hide code cell source

# Solution
import re

sample = "The colour of the city centre changed overnight."
sample = re.sub(r'colou?r', 'color', sample)
sample = re.sub(r'cent(er|re)', 'center', sample)
print(sample)
The color of the city center changed overnight.

9.2.6.3. re.fullmatch() for Validation#

re.fullmatch(pattern, text) succeeds only if the entire string matches the pattern. This is the right tool for validation tasks (IDs, simple emails, phone formats, etc.).

employee_id_pattern = r'EMP-\d{4}'
ids = ['EMP-0001', 'EMP-12', 'AEMP-0001', 'EMP-12345']

for emp_id in ids:
    print(emp_id, bool(re.fullmatch(employee_id_pattern, emp_id)))
EMP-0001 True
EMP-12 False
AEMP-0001 False
EMP-12345 False
### EXERCISE: Full String Validation
# Difficulty: Intermediate
codes = ['CS-101', 'MATH-240', 'CS101', 'EE-7']
# A valid course code must be: 2-4 uppercase letters, a dash, then 3 digits.
# 1. Write the regex pattern
# 2. Print each code with True/False using re.fullmatch
### Your code starts here:



### Your code ends here.

Hide code cell source

# Solution
codes = ['CS-101', 'MATH-240', 'CS101', 'EE-7']
pattern = r'[A-Z]{2,4}-\d{3}'
for code_str in codes:
    print(code_str, bool(re.fullmatch(pattern, code_str)))
CS-101 True
MATH-240 True
CS101 False
EE-7 False

9.2.6.4. Quick Reference#

Characters

Pattern

Meaning

Example match

.

Any character (except newline)

c.tcat, cot

\d

Digit

7

\w

Word character (letter, digit, underscore)

A, x, 9, _

\s

Whitespace (space, tab, newline)

[abc]

Character class — any one of a, b, c

a

[^abc]

Negated class — any character except a, b, c

d

Quantifiers

Pattern

Meaning

Example

*

0 or more

ab*a, ab, abb

+

1 or more

ab+ab, abb

?

0 or 1 (optional)

colou?rcolor, colour

{n}

Exactly n

\d{3}123

{n,m}

Between n and m

\d{2,4}12, 123

*? +?

Lazy (match as little as possible)

<.+?>

Anchors

Pattern

Meaning

^

Start of string (or line with re.M)

$

End of string (or line)

\b

Word boundary

Groups

Syntax

Meaning

(...)

Capturing group

(?:...)

Non-capturing group

(?P<name>...)

Named group

(?=...)

Lookahead

(?<=...)

Lookbehind

Flags

Flag

Shorthand

Meaning

re.IGNORECASE

re.I

Case-insensitive matching

re.MULTILINE

re.M

^/$ match line start/end

re.DOTALL

re.S

. matches newline too

re.VERBOSE

re.X

Allow comments/whitespace in pattern