9.2.3. The re Module#
Python’s built-in re module provides regex support. The 6 most commonly used regex functions are:
Function |
Description |
Sample Syntax |
Return |
|---|---|---|---|
|
Find first match anywhere in the string |
|
|
|
Match only at the start of the string |
|
|
|
Find all matches; return as a list |
|
|
|
Find and replace matches |
|
|
|
Split string on a pattern |
|
|
|
Match the entire string against the pattern |
re.fullmatch(pattern, text) |
|
import sys
from pathlib import Path
# Find project root by looking for _config.yml
current = Path.cwd()
for parent in [current, *current.parents]:
if (parent / '_config.yml').exists():
project_root = parent
break
else:
project_root = Path.cwd().parent.parent
# Add project root to path
sys.path.insert(0, str(project_root))
# Import shared teaching helpers and cell magics
from shared import thinkpython, diagram, jupyturtle, structshape
from shared.download import download
9.2.3.1. The Match object#
re.search(), re.match(), and re.fullmatch() functions return a Match object when pattern is matched.
For example,
returns:
re.search(pattern, text)scans throughtextand returnsa
Matchobject for the first location wherepatternis found.If the pattern is not found anywhere in the string, it returns
None.
Match: AMatchobject has the following commonly used attributes and methods:
Attribute / Method |
Description |
Example |
|---|---|---|
|
Returns the matched substring |
|
|
Index where the match begins |
|
|
Index where the match ends |
|
|
Tuple of |
|
|
The original string that was searched |
|
import re
text = "I am Dracula; and I bid you welcome, Mr. Harker, to my house."
pattern = 'Dracula'
result = re.search(pattern, text) ### pattern: Dracula; text: the line
result ### the Match object
<re.Match object; span=(5, 12), match='Dracula'>
If the pattern appears in the text, search returns a Match object that contains the results of the search.
String: Among other information, it has a variable named
stringthat contains the text that was searched.Group: It also provides a method called
groupthat returns the part of the text that matched the pattern.Span and Start/End: And it provides a method called
spanthat returns the index in the text where the pattern starts and ends.
print(result.string)
print(result.group())
print(result.start())
print(result.end())
print(result.span())
I am Dracula; and I bid you welcome, Mr. Harker, to my house.
Dracula
5
12
(5, 12)
Note
.group() returns the matched substring from the text — the portion of the text that the pattern matched against. In simple cases like re.search('Dracula', text), the match equals the pattern string. But with a regex like r'\$[\d.]+', .group() would return something like '$42.99' — the actual text that matched, not the pattern expression itself.
If the pattern doesn’t appear in the text, the return value from search is None. So we can check whether the search was successful by checking whether the result is None.
result = re.search('Count', text)
print(result)
result is None
None
True
s = "This is a test of the regular expression system."
print(re.findall('is', s)) # ['is', 'is']
print(re.findall('is.', s)) # ['is ', 'is '] ### 'is' followed by any character (space in this case)
print(re.findall('is.?', s)) # ['is ', 'is '] ### 'is' followed by zero or one character (space in this case)
print(re.findall('is.?', s, re.IGNORECASE)) # ['is ', 'is '] ### same as above, but case-insensitive
print(re.findall('is.?', s, re.IGNORECASE | re.DOTALL)) # ['is ', 'is '] ### same as above, but also makes '.' match newline characters (not relevant in this case since there are no newlines)
['is', 'is']
['is ', 'is ']
['is ', 'is ']
['is ', 'is ']
['is ', 'is ']
The + in the pattern means one or more occurrence.
import re
text = "The price is $42.99 and $7.50"
# findall — get all matches
print(re.findall(r"\$[\d.]+", text)) # ['$42.99', '$7.50']
# search — first match object
m = re.search(r"\$[\d.]+", text)
print(m.group()) # '$42.99'
print(m.start(), m.end()) # position in string
# sub — replace
print(re.sub(r"\$[\d.]+", "PRICE", text)) # 'The price is PRICE and PRICE'
# split
print(re.split(r"\s+", "one two three")) # ['one', 'two', 'three']
['$42.99', '$7.50']
$42.99
13 19
The price is PRICE and PRICE
['one', 'two', 'three']
### EXERCISE: Regex Escape Sequences
# Difficulty: Basic
s = "Price: $19.95, code=A_7, spaces here"
# 1. Extract all digit sequences
# 2. Extract all word tokens
# 3. Extract literal '$' and literal '.' matches
### Your code starts here:
### Your code ends here.
['19', '95', '7']
['Price', '19', '95', 'code', 'A_7', 'spaces', 'here']
['$']
['.']
### EXERCISE: The Match Object
# Difficulty: Basic
import re
text = "Customer ID: 4892, Order date: 2024-03-15"
# 1. Use re.search() to find the first 4-digit number in text
# 2. Print the matched string, start index, end index, and span
### Your code starts here:
### Your code ends here.
4892
13
17
(13, 17)