9.2. Regex#

Hide code cell source

import sys
from pathlib import Path

current = Path.cwd()
for parent in [current, *current.parents]:
    if (parent / '_config.yml').exists():
        project_root = parent  # ← Add project root, not chapters
        break
else:
    project_root = Path.cwd().parent.parent

sys.path.insert(0, str(project_root))

from shared import thinkpython, diagram, jupyturtle
from shared.download import download

# Register as top-level modules so direct imports work in subsequent cells
sys.modules['thinkpython'] = thinkpython
sys.modules['diagram'] = diagram
sys.modules['jupyturtle'] = jupyturtle

String methods allow you to search and manipulate text to a certain extent. A regular expression (regex) is a sequence of characters that defines a search pattern to search, match, and manipulate text in more powerful ways that go beyond simple string methods. For example:

Task

Use

Check if string starts with ‘http’

str.startswith()

Replace all spaces with underscores

str.replace()

Extract all email addresses from text

regex

Validate a phone number format

regex

Find words matching a complex pattern

regex

The rule of thumb: if the pattern is fixed and simple, use string methods. If the pattern is variable or complex, use regex.

For example, to search a pattern in a text, we may use the find() string method and an index is returned if the pattern is found.

text = "I am Dracula; and I bid you welcome, Mr. Harker,\
    to my house."
pattern = 'Dracula'
text.find(pattern)
5

9.2.1. Escape Sequences and Raw Strings#

Before using the regular expression (re) functions, we need to understand regex escapes and raw strings.

Regex patterns use backslashes heavily. In regex, the backslash \ introduces escape sequences, which create special patterns or allow metacharacters to be treated as literal characters.

For example, common escape sequences include:

Pattern

Meaning

Example match

\d

digit

7

\w

word character (letters, digits, underscore)

A, x, 9, _ ([a-zA-Z0-9_])

\s

whitespace character

space, tab, new line

\.

literal dot

.

\$

literal dollar sign

$

\\

literal backslash

\

9.2.2. Raw Strings#

A raw string, on the other hand, is a string prefixed with r, which tells Python to treat backslashes \ as literal characters rather than escape sequences.

  • Prefix with r to prevent Python from interpreting backslashes before the regex engine sees the pattern; works like \.

  • Raw strings avoid double escaping and make patterns easier to read.

  • Without raw strings, you often need extra backslashes, like '\\d+'.

regular = "\n"   # newline character
raw      = r"\n" # literally backslash + n (two characters)

print(regular)  # prints a newline
print(raw)      # prints \n
\n

Use escapes for literal special regex characters too (for example \., \$, \?).

print('\\\\')           # double backslash in a normal string
print(r'\\')            # double backslash in a raw string (same result)
print(r'\\\n')          # backslash + n in a raw string
print(r'\\\n' == '\\\\n')   # False: raw string is backslash + n, normal string is backslash + backslash + n
\\
\\
\\\n
False

Raw string tells Python we are treating this backslash \ as a special character.