The re Module

9.2.3. The re Module#

Python’s built-in re module provides regex support. The 6 most commonly used regex functions are:

Function

Description

Sample Syntax

Return

re.search()

Find first match anywhere in the string

re.search(pattern, text)

Match object or None

re.match()

Match only at the start of the string

re.match(pattern, text)

Match object or None

re.findall()

Find all matches; return as a list

re.findall(pattern, text)

list of strings

re.sub()

Find and replace matches

re.sub(pattern, repl, text)

str

re.split()

Split string on a pattern

re.split(pattern, text)

list of strings

re.fulmatch()

Match the entire string against the pattern

re.fullmatch(pattern, text)

Match object or None

import sys
from pathlib import Path

# Find project root by looking for _config.yml
current = Path.cwd()
for parent in [current, *current.parents]:
    if (parent / '_config.yml').exists():
        project_root = parent
        break
else:
    project_root = Path.cwd().parent.parent

# Add project root to path
sys.path.insert(0, str(project_root))

# Import shared teaching helpers and cell magics
from shared import thinkpython, diagram, jupyturtle, structshape
from shared.download import download

9.2.3.1. The Match object#

re.search(), re.match(), and re.fullmatch() functions return a Match object when pattern is matched.

For example,

  • returns: re.search(pattern, text) scans through text and returns

    • a Match object for the first location where pattern is found.

    • If the pattern is not found anywhere in the string, it returns None.

  • Match: A Match object has the following commonly used attributes and methods:

Attribute / Method

Description

Example

.group()

Returns the matched substring

m.group()'Dracula'

.start()

Index where the match begins

m.start()5

.end()

Index where the match ends

m.end()12

.span()

Tuple of (start, end)

m.span()(5, 12)

.string

The original string that was searched

m.string'I am Dracula...'

import re

text = "I am Dracula; and I bid you welcome, Mr. Harker, to my house."
pattern = 'Dracula'

result = re.search(pattern, text)     ### pattern: Dracula; text: the line
result                              ### the Match object
<re.Match object; span=(5, 12), match='Dracula'>

If the pattern appears in the text, search returns a Match object that contains the results of the search.

  1. String: Among other information, it has a variable named string that contains the text that was searched.

  2. Group: It also provides a method called group that returns the part of the text that matched the pattern.

  3. Span and Start/End: And it provides a method called span that returns the index in the text where the pattern starts and ends.

print(result.string)
print(result.group())
print(result.start())
print(result.end())
print(result.span())
I am Dracula; and I bid you welcome, Mr. Harker, to my house.
Dracula
5
12
(5, 12)

Note

.group() returns the matched substring from the text — the portion of the text that the pattern matched against. In simple cases like re.search('Dracula', text), the match equals the pattern string. But with a regex like r'\$[\d.]+', .group() would return something like '$42.99' — the actual text that matched, not the pattern expression itself.

If the pattern doesn’t appear in the text, the return value from search is None. So we can check whether the search was successful by checking whether the result is None.

result = re.search('Count', text)
print(result)

result is None
None
True
s = "This is a test of the regular expression system."
print(re.findall('is', s))  # ['is', 'is']
print(re.findall('is.', s)) # ['is ', 'is ']    ### 'is' followed by any character (space in this case)
print(re.findall('is.?', s)) # ['is ', 'is ']   ### 'is' followed by zero or one character (space in this case)
print(re.findall('is.?', s, re.IGNORECASE)) # ['is ', 'is '] ### same as above, but case-insensitive   
print(re.findall('is.?', s, re.IGNORECASE | re.DOTALL)) # ['is ', 'is ']    ### same as above, but also makes '.' match newline characters (not relevant in this case since there are no newlines)
['is', 'is']
['is ', 'is ']
['is ', 'is ']
['is ', 'is ']
['is ', 'is ']

The + in the pattern means one or more occurrence.

import re

text = "The price is $42.99 and $7.50"

# findall — get all matches
print(re.findall(r"\$[\d.]+", text))         # ['$42.99', '$7.50']

# search — first match object
m = re.search(r"\$[\d.]+", text)
print(m.group())                              # '$42.99'
print(m.start(), m.end())                     # position in string

# sub — replace
print(re.sub(r"\$[\d.]+", "PRICE", text))    # 'The price is PRICE and PRICE'

# split
print(re.split(r"\s+", "one  two   three"))  # ['one', 'two', 'three']
['$42.99', '$7.50']
$42.99
13 19
The price is PRICE and PRICE
['one', 'two', 'three']
### EXERCISE: Regex Escape Sequences
# Difficulty: Basic
s = "Price: $19.95, code=A_7, spaces here"
# 1. Extract all digit sequences
# 2. Extract all word tokens
# 3. Extract literal '$' and literal '.' matches
### Your code starts here:



### Your code ends here.

Hide code cell source

import re

# Solution
s = "Price: $19.95, code=A_7, spaces here"
print(re.findall(r'\d+', s))
print(re.findall(r'\w+', s))
print(re.findall(r'\$', s))
print(re.findall(r'\.', s))
['19', '95', '7']
['Price', '19', '95', 'code', 'A_7', 'spaces', 'here']
['$']
['.']
### EXERCISE: The Match Object
# Difficulty: Basic
import re
text = "Customer ID: 4892, Order date: 2024-03-15"
# 1. Use re.search() to find the first 4-digit number in text
# 2. Print the matched string, start index, end index, and span
### Your code starts here:



### Your code ends here.

Hide code cell source

# Solution
import re
text = "Customer ID: 4892, Order date: 2024-03-15"
m = re.search(r'\d{4}', text)
print(m.group())
print(m.start())
print(m.end())
print(m.span())
4892
13
17
(13, 17)