9.2.4. Metacharacters#
Metacharacters are characters that carry special meaning inside a regex pattern — instead of matching themselves literally, they instruct the regex engine to do something specific, like match any character, mark a boundary, or repeat a pattern. There are 14 of them in Python’s re module. You need to escape them if you want them to be regular characters.
Type |
Character |
Meaning |
Example |
|---|---|---|---|
Wildcard |
|
Matches any character except newline |
|
Anchor |
|
Start of string |
|
Anchor |
|
End of string |
|
Quantifier |
|
0 or more repetitions |
|
Quantifier |
|
1 or more repetitions |
|
Quantifier |
|
Optional (0 or 1) / makes quantifier lazy |
|
Quantifier |
|
Specific repetition range |
|
Character Class Delimiters |
|
Defines a set of allowed characters |
|
Grouping Delimiters |
|
Groups patterns and captures matches |
|
Escape |
|
Escapes metacharacters or forms special sequences |
|
Alternation |
|
Logical OR between patterns |
|
import sys
from pathlib import Path
# Find project root by looking for _config.yml
current = Path.cwd()
for parent in [current, *current.parents]:
if (parent / '_config.yml').exists():
project_root = parent
break
else:
project_root = Path.cwd().parent.parent
# Add project root to path
sys.path.insert(0, str(project_root))
# Import shared teaching helpers and cell magics
from shared import thinkpython, diagram, jupyturtle, structshape
from shared.download import download
import re
If you want to match the character literally, you must escape it. Now let us look at the metacharacters in groups.
9.2.4.1. Quantifiers#
Quantifiers tell the regex engine how many times the preceding character, group, or character class should match.
Quantifier |
Meaning |
Example |
Matches |
|---|---|---|---|
* |
0 or more |
ab* |
a, ab, abb, abbb |
+ |
1 or more |
ab+ |
ab, abb, abbb (not a) |
? |
0 or 1 |
ab? |
a or ab only |
{n} |
Exactly n |
\d{3} |
123, 456 |
{n,} |
n or more |
\d{2,} |
12, 123, 1234… |
{n,m} |
Between n & m |
\d{2,4} |
12, 123, 1234 |
By default quantifiers are greedy (match as much as possible). Add ? to make them lazy.
text = "<b>bold</b> and <i>italic</i>"
# Greedy — matches as much as possible
print(re.findall(r"<.+>", text)) # ['<b>bold</b> and <i>italic</i>']
# Lazy — matches as little as possible
print(re.findall(r"<.+?>", text)) # ['<b>', '</b>', '<i>', '</i>']
# Exact and ranged quantifiers
print(re.findall(r"\d{3}", "123 4567 89")) # ['123', '456']
print(re.findall(r"\d{2,4}", "1 12 123 1234")) # ['12', '123', '1234']
['<b>bold</b> and <i>italic</i>']
['<b>', '</b>', '<i>', '</i>']
['123', '456']
['12', '123', '1234']
In <.+>, the + is greedy. It matches as many characters as possible while still allowing the overall pattern to succeed. So it gobbles everything from the first < all the way to the last >.
Adding ? after a quantifier switches it to lazy mode — instead of matching as much as possible, it now matches as little as possible. So <.+?> still needs at least one character (that’s the +), but stops at the earliest > it can find.
9.2.4.2. Greedy vs Non-greedy#
Quantifiers like * and + are greedy by default. Add ? to make them non-greedy.
text_block = """Title: Notes
Email: ALICE@example.com
Email: bob@Example.org"""
# IGNORECASE
emails = re.findall(r'[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}', text_block, flags=re.IGNORECASE)
print(emails)
# Greedy vs non-greedy on tags
html = "<b>bold</b><i>italic</i>"
print(re.findall(r'<.*>', html)) # greedy
print(re.findall(r'<.*?>', html)) # non-greedy
['ALICE@example.com', 'bob@Example.org']
['<b>bold</b><i>italic</i>']
['<b>', '</b>', '<i>', '</i>']
### EXERCISE: Flags and Quantifiers
# Difficulty: Challenge
text_block = """Task: clean logs
ERROR: Disk full
error: retry failed
INFO: done"""
# 1. Extract all lines that start with 'error' (case-insensitive) using MULTILINE
# 2. From '<x>1</x><x>2</x>', extract tags non-greedily
### Your code starts here:
### Your code ends here.
# Solution
text_block = """Task: clean logs
ERROR: Disk full
error: retry failed
INFO: done"""
errs = re.findall(r'^error:.*$', text_block, flags=re.IGNORECASE | re.MULTILINE)
print(errs)
print(re.findall(r'<.*?>', '<x>1</x><x>2</x>'))
['ERROR: Disk full', 'error: retry failed']
['<x>', '</x>', '<x>', '</x>']
9.2.4.3. Anchors#
Anchors don’t match characters — they match positions in the string.
Anchor |
Meaning |
|---|---|
|
Start of string (or line with |
|
End of string (or line) |
|
Word boundary |
|
Non-word boundary |
# ^ and $
print(re.findall(r"^\w+", "Hello world")) # ['Hello'] — only at start
print(re.findall(r"\w+$", "Hello world")) # ['world'] — only at end
# Word boundary \b
text = "cat catfish concatenate"
print(re.findall(r"\bcat\b", text)) # ['cat'] — whole word only
print(re.findall(r"cat", text)) # ['cat', 'cat', 'cat'] — anywhere
# Multiline
multi = "line1\nline2\nline3"
print(re.findall(r"^\w+", multi, re.MULTILINE)) # ['line1', 'line2', 'line3']
['Hello']
['world']
['cat']
['cat', 'cat', 'cat']
['line1', 'line2', 'line3']
9.2.4.4. Character Classes#
Before writing larger patterns, it helps to know the core building blocks. Character classes match one character from a defined set. They’re written with square brackets [ ].
Pattern |
Matches |
|---|---|
[aeiou] |
any single vowel |
[a-z] |
any lowercase letter |
[A-Z] |
any uppercase letter |
[0-9] |
any digit |
[a-zA-Z0-9] |
any alphanumeric character |
[^aeiou] |
any character not a vowel (^ negates) |
Shorthand classes (work outside brackets too):
Pattern |
Meaning |
Example Match |
|---|---|---|
|
Any character (except newline) |
|
|
digit |
|
|
word char (letter/digit/underscore) |
|
|
whitespace |
space, tab |
Observe the escape sequence '\w'.
import re
s = "This is a regular expression."
print(re.findall(r'\w', s)) ### \w matches any alphanumeric character (letters, digits, and underscore)
print(re.findall(r'\w+', s)) ### + means "one or more occurrences of the preceding pattern"
print(re.findall(r'\w*', s)) ### * means "zero or more occurrences of the preceding pattern"
['T', 'h', 'i', 's', 'i', 's', 'a', 'r', 'e', 'g', 'u', 'l', 'a', 'r', 'e', 'x', 'p', 'r', 'e', 's', 's', 'i', 'o', 'n']
['This', 'is', 'a', 'regular', 'expression']
['This', '', 'is', '', 'a', '', 'regular', '', 'expression', '', '']
\\s matches these whitespace characters:
Character |
Name |
|---|---|
|
newline |
|
tab |
|
carriage return |
|
space |
|
form feed |
|
vertical tab |
Use raw strings like r'\d+' for regex patterns so backslashes are interpreted correctly.
import re
text = "Hello World 123! foo_bar"
print(re.findall(r"\d", text)) # individual digits
print(re.findall(r"\d+", text)) # consecutive digits
print(re.findall(r"\w+", text)) # words (incl. underscore)
print(re.findall(r"[A-Z][a-z]+", text)) # capitalized words
print(re.findall(r"[^a-zA-Z\s]+", text)) # non-alpha, non-space
['1', '2', '3']
['123']
['Hello', 'World', '123', 'foo_bar']
['Hello', 'World']
['123!', '_']
sample = "User_42 logged in at 09:30 on 2026-03-11"
print(re.findall(r'\d+', sample)) # all digit runs
print(re.findall(r'[A-Za-z_]+', sample)) # word-like alphabetic tokens
print(re.findall(r'\d{2}:\d{2}', sample)) # HH:MM time
print(re.findall(r'\d{4}-\d{2}-\d{2}', sample)) # YYYY-MM-DD date
['42', '09', '30', '2026', '03', '11']
['User_', 'logged', 'in', 'at', 'on']
['09:30']
['2026-03-11']
### EXERCISE: Regex Syntax Essentials
# Difficulty: Basic
s = "IDs: A12, B7, C999"
# 1. Extract all uppercase letters
# 2. Extract all digit sequences
# 3. Extract letter+digit tokens like A12, B7, C999
### Your code starts here:
### Your code ends here.
['I', 'D', 'A', 'B', 'C']
['12', '7', '999']
['A12', 'B7', 'C999']
9.2.4.5. Groups & Capturing#
Parentheses () group part of a pattern into a single unit. A capturing group also saves the matched text so you can extract or reuse it afterward. Use non-capturing groups (?:...) when you need grouping for structure but don’t need to extract the text. Named groups (?P<name>...) let you refer to captured text by name instead of number.
Syntax |
Meaning |
|---|---|
|
Capturing group |
|
Non-capturing group |
|
Named group |
|
Alternation (OR) |
# Capturing groups
dates = "2024-01-15 and 2023-12-31"
print(re.findall(r"(\d{4})-(\d{2})-(\d{2})", dates))
# [('2024', '01', '15'), ('2023', '12', '31')]
# Named groups: m is one object from the search, so gives only one match, not all matches
m = re.search(r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})", dates)
print(m.group('year'), m.group('month'), m.group('day'))
# Alternation
print(re.findall(r"cat|dog", "I have a cat and a dog")) # ['cat', 'dog']
# Using groups in sub()
print(re.sub(r"(\d{4})-(\d{2})-(\d{2})", r"\2/\3/\1", dates))
# '01/15/2024 and 12/31/2023'
[('2024', '01', '15'), ('2023', '12', '31')]
2024 01 15
['cat', 'dog']
01/15/2024 and 12/31/2023
9.2.4.6. Groups and Extraction#
Parentheses create capture groups. You can extract parts of a match with .group(1), .group(2), etc.
Named groups can make patterns more readable.
record = "OrderID=4821; Customer=Alice; Total=$39.50"
pattern = r'OrderID=(\d+); Customer=([A-Za-z]+); Total=\$(\d+(?:\.\d{2})?)'
m = re.search(pattern, record)
print(m.group(0)) # full match
print(m.group(1)) # order id
print(m.group(2)) # customer
print(m.group(3)) # total amount
OrderID=4821; Customer=Alice; Total=$39.50
4821
Alice
39.50
### EXERCISE: Capture Groups
# Difficulty: Intermediate
line = "name=Bob,age=27,dept=Sales"
# 1. Use one regex with 3 capture groups to extract name, age, dept
# 2. Print each extracted value on its own line
### Your code starts here:
### Your code ends here.
Bob
27
Sales
9.2.4.7. Alternation (OR)#
Use | to match one of multiple patterns.
import re
# pattern using alternation
pattern = r"\.(jpg|png|gif)$"
files = [
"photo.jpg",
"diagram.png",
"animation.gif",
"document.pdf",
"archive.zip"
]
for file in files:
if re.search(pattern, file):
print(f"{file} -> valid image file")
else:
print(f"{file} -> not an image")
photo.jpg -> valid image file
diagram.png -> valid image file
animation.gif -> valid image file
document.pdf -> not an image
archive.zip -> not an image
import re
pattern = r"\b(coffee|tea)\b"
sentences = [
"I like coffee in the morning.",
"She prefers tea at night.",
"He drinks water.",
"Coffee is my favorite."
]
for sentence in sentences:
match = re.search(pattern, sentence, re.IGNORECASE)
if match:
print(f"Found beverage: {match.group()}")
else:
print("No beverage found")
Found beverage: coffee
Found beverage: tea
No beverage found
Found beverage: Coffee