{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a245d84a",
   "metadata": {},
   "source": [
    "# Metacharacters\n",
    "\n",
    "Metacharacters are characters that carry special meaning inside a regex pattern — instead of matching themselves literally, they instruct the regex engine to do something specific, like match any character, mark a boundary, or repeat a pattern. There are 14 of them in Python's re module. You need to escape them if you want them to be regular characters.\n",
    "\n",
    "| Type            | Character | Meaning      | Example                     | \n",
    "| -------------------------- | --------- | ------------------------------------------------- | --------------------------- | \n",
    "| Wildcard                   | `.`       | Matches any character except newline              | `c.t` → cat, cot            |      \n",
    "| Anchor                     | `^`       | Start of string                                   | `^Hello`                    |      \n",
    "| Anchor                     | `$`       | End of string                                     | `end$`            |   \n",
    "| Quantifier                 | `*`       | 0 or more repetitions                             | `a*`                        |      \n",
    "| Quantifier                 | `+`       | 1 or more repetitions   | `a+`                        |      \n",
    "| Quantifier                 | `?`       | Optional (0 or 1) / makes quantifier lazy   | `colou?r`                   |       \n",
    "| Quantifier                 | `{}`      | Specific repetition range  | `\\d{3}`  |    |    |\n",
    "| Character Class Delimiters | `[]`      | Defines a set of allowed characters   | `[a-z]`                     |      \n",
    "| Grouping Delimiters        | `()`  | Groups patterns and captures matches  | `(cat\\| dog)` |  |\n",
    "| Escape           | `\\`  | Escapes metacharacters or forms special sequences | `\\d`, `\\w`, `\\.`            |       \n",
    "| Alternation                | `\\|`        | Logical OR between patterns | `cat  \\| dog` |\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d1adbc55",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "from pathlib import Path\n",
    "\n",
    "# Find project root by looking for _config.yml\n",
    "current = Path.cwd()\n",
    "for parent in [current, *current.parents]:\n",
    "    if (parent / '_config.yml').exists():\n",
    "        project_root = parent\n",
    "        break\n",
    "else:\n",
    "    project_root = Path.cwd().parent.parent\n",
    "\n",
    "# Add project root to path\n",
    "sys.path.insert(0, str(project_root))\n",
    "\n",
    "# Import shared teaching helpers and cell magics\n",
    "from shared import thinkpython, diagram, jupyturtle, structshape\n",
    "from shared.download import download\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5dcf2aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d07fdcb",
   "metadata": {},
   "source": [
    "If you want to match the character literally, you must escape it. Now let us look at the metacharacters in groups."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ce72136",
   "metadata": {},
   "source": [
    "## Quantifiers\n",
    "\n",
    "Quantifiers tell the regex engine how many times the preceding character, group, or character class should match.\n",
    "\n",
    "| Quantifier | Meaning       | Example  | Matches                        |\n",
    "|------------|---------------|----------|--------------------------------|\n",
    "| *          | **0 or more**     | ab*      | a, ab, abb, abbb           |\n",
    "| +          | **1 or more**    | ab+      | ab, abb, abbb  (not a)      |\n",
    "| ?          | **0 or 1**        | ab?      | a  or  ab  only            |\n",
    "| {n}        | Exactly n     | \\d{3}    | 123, 456                       |\n",
    "| {n,}       | n or more     | \\d{2,}   | 12, 123, 1234...               |\n",
    "| {n,m}      | Between n & m | \\d{2,4}  | 12, 123, 1234                  |\n",
    "\n",
    "By default quantifiers are **greedy** (match as much as possible). Add `?` to make them **lazy**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90c996be",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['<b>bold</b> and <i>italic</i>']\n",
      "['<b>', '</b>', '<i>', '</i>']\n",
      "['123', '456']\n",
      "['12', '123', '1234']\n"
     ]
    }
   ],
   "source": [
    "text = \"<b>bold</b> and <i>italic</i>\"\n",
    "\n",
    "# Greedy — matches as much as possible\n",
    "print(re.findall(r\"<.+>\", text))     # ['<b>bold</b> and <i>italic</i>']\n",
    "\n",
    "# Lazy — matches as little as possible\n",
    "print(re.findall(r\"<.+?>\", text))    # ['<b>', '</b>', '<i>', '</i>']\n",
    "\n",
    "# Exact and ranged quantifiers\n",
    "print(re.findall(r\"\\d{3}\", \"123 4567 89\"))     # ['123', '456']\n",
    "print(re.findall(r\"\\d{2,4}\", \"1 12 123 1234\")) # ['12', '123', '1234']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b76fbc8",
   "metadata": {},
   "source": [
    "In `<.+>`, the `+` is **greedy**. It matches as many characters as possible while still allowing the overall pattern to succeed. So it gobbles everything from the first `<` all the way to the last `>`.\n",
    "\n",
    "Adding `?` after a quantifier switches it to **lazy** mode — instead of matching as much as possible, it now matches **as little as possible**. So `<.+?>` still needs at least one character (that's the `+`), but stops at the earliest `>` it can find."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de7246f4",
   "metadata": {},
   "source": [
    "## Greedy vs Non-greedy\n",
    "\n",
    "Quantifiers like `*` and `+` are greedy by default. Add `?` to make them non-greedy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "453800d5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['ALICE@example.com', 'bob@Example.org']\n",
      "['<b>bold</b><i>italic</i>']\n",
      "['<b>', '</b>', '<i>', '</i>']\n"
     ]
    }
   ],
   "source": [
    "text_block = \"\"\"Title: Notes\n",
    "Email: ALICE@example.com\n",
    "Email: bob@Example.org\"\"\"\n",
    "\n",
    "# IGNORECASE\n",
    "emails = re.findall(r'[a-z0-9._%+-]+@[a-z0-9.-]+\\.[a-z]{2,}', text_block, flags=re.IGNORECASE)\n",
    "print(emails)\n",
    "\n",
    "# Greedy vs non-greedy on tags\n",
    "html = \"<b>bold</b><i>italic</i>\"\n",
    "print(re.findall(r'<.*>', html))     # greedy\n",
    "print(re.findall(r'<.*?>', html))    # non-greedy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33dbf4fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "### EXERCISE: Flags and Quantifiers\n",
    "# Difficulty: Challenge\n",
    "text_block = \"\"\"Task: clean logs\n",
    "ERROR: Disk full\n",
    "error: retry failed\n",
    "INFO: done\"\"\"\n",
    "# 1. Extract all lines that start with 'error' (case-insensitive) using MULTILINE\n",
    "# 2. From '<x>1</x><x>2</x>', extract tags non-greedily\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "93c34729",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['ERROR: Disk full', 'error: retry failed']\n",
      "['<x>', '</x>', '<x>', '</x>']\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "text_block = \"\"\"Task: clean logs\n",
    "ERROR: Disk full\n",
    "error: retry failed\n",
    "INFO: done\"\"\"\n",
    "\n",
    "errs = re.findall(r'^error:.*$', text_block, flags=re.IGNORECASE | re.MULTILINE)\n",
    "print(errs)\n",
    "\n",
    "print(re.findall(r'<.*?>', '<x>1</x><x>2</x>'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e80681c",
   "metadata": {},
   "source": [
    "## Anchors\n",
    "\n",
    "Anchors don't match characters — they match positions in the string.\n",
    "\n",
    "| Anchor | Meaning |\n",
    "|---|---|\n",
    "| `^` | Start of string (or line with `re.MULTILINE`) |\n",
    "| `$` | End of string (or line) |\n",
    "| `\\b` | Word boundary |\n",
    "| `\\B` | Non-word boundary |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "2b48b6ae",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Hello']\n",
      "['world']\n",
      "['cat']\n",
      "['cat', 'cat', 'cat']\n",
      "['line1', 'line2', 'line3']\n"
     ]
    }
   ],
   "source": [
    "# ^ and $\n",
    "print(re.findall(r\"^\\w+\", \"Hello world\"))     # ['Hello'] — only at start\n",
    "print(re.findall(r\"\\w+$\", \"Hello world\"))     # ['world'] — only at end\n",
    "\n",
    "# Word boundary \\b\n",
    "text = \"cat catfish concatenate\"\n",
    "print(re.findall(r\"\\bcat\\b\", text))           # ['cat'] — whole word only\n",
    "print(re.findall(r\"cat\", text))               # ['cat', 'cat', 'cat'] — anywhere\n",
    "\n",
    "# Multiline\n",
    "multi = \"line1\\nline2\\nline3\"\n",
    "print(re.findall(r\"^\\w+\", multi, re.MULTILINE))  # ['line1', 'line2', 'line3']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd02837e",
   "metadata": {},
   "source": [
    "## Character Classes\n",
    "\n",
    "Before writing larger patterns, it helps to know the core building blocks. **Character classes** match one character from a defined set. They're written with square brackets `[ ]`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60663719",
   "metadata": {},
   "source": [
    "| Pattern |\tMatches |\n",
    "| --- | --- | \n",
    "| [aeiou]\t| any single vowel |\n",
    "| [a-z]\t| any lowercase letter |\n",
    "| [A-Z]\t| any uppercase letter |\n",
    "| [0-9]\t| any digit |\n",
    "| [a-zA-Z0-9] |\tany alphanumeric character |\n",
    "| [^aeiou]\t| any character not a vowel (^ negates) |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "230aa9bf",
   "metadata": {},
   "source": [
    "Shorthand classes (work outside brackets too):"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d70de733",
   "metadata": {},
   "source": [
    "| Pattern | Meaning | Example Match |\n",
    "|---|---|---|\n",
    "| `.` | Any character (except newline) \n",
    "| `\\d` | digit | `7` |\n",
    "| `\\w` | word char (letter/digit/underscore) | `A`, `x`, `9`, `_` |\n",
    "| `\\s` | whitespace | space, tab |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44b5903b",
   "metadata": {},
   "source": [
    "Observe the escape sequence `'\\w'`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bbe068a5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['T', 'h', 'i', 's', 'i', 's', 'a', 'r', 'e', 'g', 'u', 'l', 'a', 'r', 'e', 'x', 'p', 'r', 'e', 's', 's', 'i', 'o', 'n']\n",
      "['This', 'is', 'a', 'regular', 'expression']\n",
      "['This', '', 'is', '', 'a', '', 'regular', '', 'expression', '', '']\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "s = \"This is a regular expression.\"\n",
    "print(re.findall(r'\\w', s))     ### \\w matches any alphanumeric character (letters, digits, and underscore)\n",
    "print(re.findall(r'\\w+', s))    ### + means \"one or more occurrences of the preceding pattern\"\n",
    "print(re.findall(r'\\w*', s))    ### * means \"zero or more occurrences of the preceding pattern\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a7268cf",
   "metadata": {},
   "source": [
    "`\\\\s` matches these whitespace characters:\n",
    "\n",
    "| Character | Name |\n",
    "|---|---|\n",
    "| `\\\\n` | newline |\n",
    "| `\\\\t` | tab |\n",
    "| `\\\\r` | carriage return |\n",
    "| ` ` | space |\n",
    "| `\\\\f` | form feed |\n",
    "| `\\\\v` | vertical tab |\n",
    "\n",
    "Use raw strings like `r'\\d+'` for regex patterns so backslashes are interpreted correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6dfe973",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['1', '2', '3']\n",
      "['123']\n",
      "['Hello', 'World', '123', 'foo_bar']\n",
      "['Hello', 'World']\n",
      "['123!', '_']\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "text = \"Hello World 123! foo_bar\"\n",
    "\n",
    "print(re.findall(r\"\\d\", text))        # individual digits\n",
    "print(re.findall(r\"\\d+\", text))       # consecutive digits\n",
    "print(re.findall(r\"\\w+\", text))       # words (incl. underscore)\n",
    "print(re.findall(r\"[A-Z][a-z]+\", text))  # capitalized words\n",
    "print(re.findall(r\"[^a-zA-Z\\s]+\", text)) # non-alpha, non-space"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16a6a5c6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['42', '09', '30', '2026', '03', '11']\n",
      "['User_', 'logged', 'in', 'at', 'on']\n",
      "['09:30']\n",
      "['2026-03-11']\n"
     ]
    }
   ],
   "source": [
    "sample = \"User_42 logged in at 09:30 on 2026-03-11\"\n",
    "\n",
    "print(re.findall(r'\\d+', sample))                  # all digit runs\n",
    "print(re.findall(r'[A-Za-z_]+', sample))            # word-like alphabetic tokens\n",
    "print(re.findall(r'\\d{2}:\\d{2}', sample))        # HH:MM time\n",
    "print(re.findall(r'\\d{4}-\\d{2}-\\d{2}', sample))  # YYYY-MM-DD date"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf8a0af3",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Regex Syntax Essentials\n",
    "# Difficulty: Basic\n",
    "s = \"IDs: A12, B7, C999\"\n",
    "# 1. Extract all uppercase letters\n",
    "# 2. Extract all digit sequences\n",
    "# 3. Extract letter+digit tokens like A12, B7, C999\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30189825",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['I', 'D', 'A', 'B', 'C']\n",
      "['12', '7', '999']\n",
      "['A12', 'B7', 'C999']\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "s = \"IDs: A12, B7, C999\"\n",
    "print(re.findall(r'[A-Z]', s))\n",
    "print(re.findall(r'\\d+', s))\n",
    "print(re.findall(r'[A-Z]\\d+', s))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "161e1a6b",
   "metadata": {},
   "source": [
    "## Groups & Capturing\n",
    "\n",
    "Parentheses `()` group part of a pattern into a single unit. A **capturing group** also saves the matched text so you can extract or reuse it afterward. Use **non-capturing groups** `(?:...)` when you need grouping for structure but don't need to extract the text. **Named groups** `(?P<name>...)` let you refer to captured text by name instead of number.\n",
    "\n",
    "| Syntax | Meaning |\n",
    "|---|---|\n",
    "| `(...)` | Capturing group |\n",
    "| `(?:...)` | Non-capturing group |\n",
    "| `(?P<name>...)` | Named group |\n",
    "| `\\|` | Alternation (OR) |\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "7f845756",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('2024', '01', '15'), ('2023', '12', '31')]\n",
      "2024 01 15\n",
      "['cat', 'dog']\n",
      "01/15/2024 and 12/31/2023\n"
     ]
    }
   ],
   "source": [
    "# Capturing groups\n",
    "dates = \"2024-01-15 and 2023-12-31\"\n",
    "print(re.findall(r\"(\\d{4})-(\\d{2})-(\\d{2})\", dates))\n",
    "# [('2024', '01', '15'), ('2023', '12', '31')]\n",
    "\n",
    "# Named groups: m is one object from the search, so gives only one match, not all matches\n",
    "m = re.search(r\"(?P<year>\\d{4})-(?P<month>\\d{2})-(?P<day>\\d{2})\", dates)\n",
    "print(m.group('year'), m.group('month'), m.group('day'))\n",
    "\n",
    "# Alternation\n",
    "print(re.findall(r\"cat|dog\", \"I have a cat and a dog\"))  # ['cat', 'dog']\n",
    "\n",
    "# Using groups in sub()\n",
    "print(re.sub(r\"(\\d{4})-(\\d{2})-(\\d{2})\", r\"\\2/\\3/\\1\", dates))\n",
    "# '01/15/2024 and 12/31/2023'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b006ebdc",
   "metadata": {},
   "source": [
    "## Groups and Extraction\n",
    "\n",
    "Parentheses create capture groups. You can extract parts of a match with `.group(1)`, `.group(2)`, etc.\n",
    "Named groups can make patterns more readable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "id": "6e827051",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "OrderID=4821; Customer=Alice; Total=$39.50\n",
      "4821\n",
      "Alice\n",
      "39.50\n"
     ]
    }
   ],
   "source": [
    "record = \"OrderID=4821; Customer=Alice; Total=$39.50\"\n",
    "pattern = r'OrderID=(\\d+); Customer=([A-Za-z]+); Total=\\$(\\d+(?:\\.\\d{2})?)'\n",
    "m = re.search(pattern, record)\n",
    "\n",
    "print(m.group(0))  # full match\n",
    "print(m.group(1))  # order id\n",
    "print(m.group(2))  # customer\n",
    "print(m.group(3))  # total amount"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "id": "13f21af9",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Capture Groups\n",
    "# Difficulty: Intermediate\n",
    "line = \"name=Bob,age=27,dept=Sales\"\n",
    "# 1. Use one regex with 3 capture groups to extract name, age, dept\n",
    "# 2. Print each extracted value on its own line\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "id": "2224328f",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Bob\n",
      "27\n",
      "Sales\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "line = \"name=Bob,age=27,dept=Sales\"\n",
    "m = re.search(r'name=([A-Za-z]+),age=(\\d+),dept=([A-Za-z]+)', line)\n",
    "print(m.group(1))\n",
    "print(m.group(2))\n",
    "print(m.group(3))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cce69516",
   "metadata": {},
   "source": [
    "## Alternation (OR)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9a02878",
   "metadata": {},
   "source": [
    "Use `|` to match one of multiple patterns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "c3e51f4a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "photo.jpg -> valid image file\n",
      "diagram.png -> valid image file\n",
      "animation.gif -> valid image file\n",
      "document.pdf -> not an image\n",
      "archive.zip -> not an image\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "# pattern using alternation\n",
    "pattern = r\"\\.(jpg|png|gif)$\"\n",
    "\n",
    "files = [\n",
    "    \"photo.jpg\",\n",
    "    \"diagram.png\",\n",
    "    \"animation.gif\",\n",
    "    \"document.pdf\",\n",
    "    \"archive.zip\"\n",
    "]\n",
    "\n",
    "for file in files:\n",
    "    if re.search(pattern, file):\n",
    "        print(f\"{file} -> valid image file\")\n",
    "    else:\n",
    "        print(f\"{file} -> not an image\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "ff59c4e8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found beverage: coffee\n",
      "Found beverage: tea\n",
      "No beverage found\n",
      "Found beverage: Coffee\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "pattern = r\"\\b(coffee|tea)\\b\"\n",
    "\n",
    "sentences = [\n",
    "    \"I like coffee in the morning.\",\n",
    "    \"She prefers tea at night.\",\n",
    "    \"He drinks water.\",\n",
    "    \"Coffee is my favorite.\"\n",
    "]\n",
    "\n",
    "for sentence in sentences:\n",
    "    match = re.search(pattern, sentence, re.IGNORECASE)\n",
    "\n",
    "    if match:\n",
    "        print(f\"Found beverage: {match.group()}\")\n",
    "    else:\n",
    "        print(\"No beverage found\")\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}