{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3dab4e57",
   "metadata": {},
   "source": [
    "# The `re` Module\n",
    "\n",
    "Python's built-in `re` module provides regex support. The 6 most commonly used regex functions are:\n",
    "\n",
    "| Function | Description | Sample Syntax | Return |\n",
    "|---|---|---|---|\n",
    "| `re.search()` | Find **first** match anywhere in the string | `re.search(pattern, text)` | **`Match` object** or `None` |\n",
    "| `re.match()` | Match only at the **start** of the string | `re.match(pattern, text)` | **`Match` object** or `None` |\n",
    "| `re.findall()` | Find **all** matches; return as a **list** | `re.findall(pattern, text)` | `list` of strings |\n",
    "| `re.sub()` | Find and replace matches | `re.sub(pattern, repl, text)` | `str` |\n",
    "| `re.split()` | Split string on a pattern | `re.split(pattern, text)` | `list` of strings |\n",
    "| `re.fulmatch()` | Match the entire string against the pattern | re.fullmatch(pattern, text) | **`Match` object** or `None` |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c7368273",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "from pathlib import Path\n",
    "\n",
    "# Find project root by looking for _config.yml\n",
    "current = Path.cwd()\n",
    "for parent in [current, *current.parents]:\n",
    "    if (parent / '_config.yml').exists():\n",
    "        project_root = parent\n",
    "        break\n",
    "else:\n",
    "    project_root = Path.cwd().parent.parent\n",
    "\n",
    "# Add project root to path\n",
    "sys.path.insert(0, str(project_root))\n",
    "\n",
    "# Import shared teaching helpers and cell magics\n",
    "from shared import thinkpython, diagram, jupyturtle, structshape\n",
    "from shared.download import download\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1950d45",
   "metadata": {},
   "source": [
    "## The `Match` object\n",
    "\n",
    "`re.search()`, `re.match()`, and `re.fullmatch()` functions return a `Match object` when pattern is matched. \n",
    "\n",
    "For example, \n",
    "- returns: `re.search(pattern, text)` scans through `text` and returns \n",
    "  - a `Match` object for the **first** location where `pattern` is found. \n",
    "  - If the pattern is not found anywhere in the string, it returns `None`. \n",
    "\n",
    "- `Match`: A `Match` object has the following commonly used attributes and methods: \n",
    "\n",
    "| Attribute / Method | Description | Example |\n",
    "|---|---|---|\n",
    "| `.group()` | Returns the matched substring | `m.group()` → `'Dracula'` |\n",
    "| `.start()` | Index where the match begins | `m.start()` → `5` |\n",
    "| `.end()` | Index where the match ends | `m.end()` → `12` |\n",
    "| `.span()` | Tuple of `(start, end)` | `m.span()` → `(5, 12)` |\n",
    "| `.string` | The original string that was searched | `m.string` → `'I am Dracula...'` |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "62713517",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<re.Match object; span=(5, 12), match='Dracula'>"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "text = \"I am Dracula; and I bid you welcome, Mr. Harker, to my house.\"\n",
    "pattern = 'Dracula'\n",
    "\n",
    "result = re.search(pattern, text)     ### pattern: Dracula; text: the line\n",
    "result                              ### the Match object"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ecf6690b",
   "metadata": {},
   "source": [
    "If the pattern appears in the text, `search` returns a `Match` object that contains the results of the search. \n",
    "\n",
    "1. String: Among other information, it has a variable named `string` that contains the text that was searched.\n",
    "2. Group: It also provides a method called `group` that returns the part of the text that **matched** the pattern.\n",
    "3. Span and Start/End: And it provides a method called `span` that returns the index in the text where the pattern starts and ends."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "6fdd12a9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I am Dracula; and I bid you welcome, Mr. Harker, to my house.\n",
      "Dracula\n",
      "5\n",
      "12\n",
      "(5, 12)\n"
     ]
    }
   ],
   "source": [
    "print(result.string)\n",
    "print(result.group())\n",
    "print(result.start())\n",
    "print(result.end())\n",
    "print(result.span())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d38ad650",
   "metadata": {},
   "source": [
    ":::{note}\n",
    "`.group()` returns the **matched substring from the text** — the portion of the text that the pattern matched against. In simple cases like `re.search('Dracula', text)`, the match equals the pattern string. But with a regex like `r'\\$[\\d.]+'`, `.group()` would return something like `'$42.99'` — the actual text that matched, not the pattern expression itself.\n",
    ":::"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2fb89ac5",
   "metadata": {},
   "source": [
    "If the pattern doesn't appear in the text, the return value from `search` is `None`. So we can check whether the search was successful by checking whether the result is `None`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "613a304d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "None\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = re.search('Count', text)\n",
    "print(result)\n",
    "\n",
    "result is None"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "3a3b88a7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['is', 'is']\n",
      "['is ', 'is ']\n",
      "['is ', 'is ']\n",
      "['is ', 'is ']\n",
      "['is ', 'is ']\n"
     ]
    }
   ],
   "source": [
    "s = \"This is a test of the regular expression system.\"\n",
    "print(re.findall('is', s))  # ['is', 'is']\n",
    "print(re.findall('is.', s)) # ['is ', 'is ']    ### 'is' followed by any character (space in this case)\n",
    "print(re.findall('is.?', s)) # ['is ', 'is ']   ### 'is' followed by zero or one character (space in this case)\n",
    "print(re.findall('is.?', s, re.IGNORECASE)) # ['is ', 'is '] ### same as above, but case-insensitive   \n",
    "print(re.findall('is.?', s, re.IGNORECASE | re.DOTALL)) # ['is ', 'is ']    ### same as above, but also makes '.' match newline characters (not relevant in this case since there are no newlines)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ffef1c34",
   "metadata": {},
   "source": [
    "The `+` in the pattern means one or more occurrence. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "daa3fba6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['$42.99', '$7.50']\n",
      "$42.99\n",
      "13 19\n",
      "The price is PRICE and PRICE\n",
      "['one', 'two', 'three']\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "text = \"The price is $42.99 and $7.50\"\n",
    "\n",
    "# findall — get all matches\n",
    "print(re.findall(r\"\\$[\\d.]+\", text))         # ['$42.99', '$7.50']\n",
    "\n",
    "# search — first match object\n",
    "m = re.search(r\"\\$[\\d.]+\", text)\n",
    "print(m.group())                              # '$42.99'\n",
    "print(m.start(), m.end())                     # position in string\n",
    "\n",
    "# sub — replace\n",
    "print(re.sub(r\"\\$[\\d.]+\", \"PRICE\", text))    # 'The price is PRICE and PRICE'\n",
    "\n",
    "# split\n",
    "print(re.split(r\"\\s+\", \"one  two   three\"))  # ['one', 'two', 'three']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "96205219",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Regex Escape Sequences\n",
    "# Difficulty: Basic\n",
    "s = \"Price: $19.95, code=A_7, spaces here\"\n",
    "# 1. Extract all digit sequences\n",
    "# 2. Extract all word tokens\n",
    "# 3. Extract literal '$' and literal '.' matches\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "8b965a41",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['19', '95', '7']\n",
      "['Price', '19', '95', 'code', 'A_7', 'spaces', 'here']\n",
      "['$']\n",
      "['.']\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "# Solution\n",
    "s = \"Price: $19.95, code=A_7, spaces here\"\n",
    "print(re.findall(r'\\d+', s))\n",
    "print(re.findall(r'\\w+', s))\n",
    "print(re.findall(r'\\$', s))\n",
    "print(re.findall(r'\\.', s))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c5d4f62",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: The Match Object\n",
    "# Difficulty: Basic\n",
    "import re\n",
    "text = \"Customer ID: 4892, Order date: 2024-03-15\"\n",
    "# 1. Use re.search() to find the first 4-digit number in text\n",
    "# 2. Print the matched string, start index, end index, and span\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6408257c",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Solution\n",
    "import re\n",
    "text = \"Customer ID: 4892, Order date: 2024-03-15\"\n",
    "m = re.search(r'\\d{4}', text)\n",
    "print(m.group())\n",
    "print(m.start())\n",
    "print(m.end())\n",
    "print(m.span())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
