{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b5861fdc",
   "metadata": {},
   "source": [
    "# Applications\n",
    "\n",
    "## Cleaning Text\n",
    "\n",
    "Before we can search the text of *Dracula*, we need to download it from Project Gutenberg and remove the header and footer information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a196f4ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "from pathlib import Path\n",
    "\n",
    "# Find project root by looking for _config.yml\n",
    "current = Path.cwd()\n",
    "for parent in [current, *current.parents]:\n",
    "    if (parent / '_config.yml').exists():\n",
    "        project_root = parent\n",
    "        break\n",
    "else:\n",
    "    project_root = Path.cwd().parent.parent\n",
    "\n",
    "# Add project root to path\n",
    "sys.path.insert(0, str(project_root))\n",
    "\n",
    "# Import shared teaching helpers and cell magics\n",
    "from shared import thinkpython, diagram, jupyturtle, structshape\n",
    "from shared.download import download\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27c690d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1afc4a3",
   "metadata": {},
   "source": [
    "We'll download the Dracula text from Project Gutenberg and save it to the `data` folder. Then we'll clean the file and save the cleaned version in the same folder. All subsequent analysis will use these files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "id": "68b04060",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dracula already downloaded: /Users/tcn85/workspace/py/data/pg345.txt\n"
     ]
    }
   ],
   "source": [
    "from pathlib import Path\n",
    "from urllib.request import urlretrieve\n",
    "\n",
    "data_dir = project_root / 'data'\n",
    "data_dir.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "# Download Dracula text to the project data folder\n",
    "url = 'https://www.gutenberg.org/files/345/345-0.txt'\n",
    "raw_path = data_dir / 'pg345.txt'\n",
    "clean_path = data_dir / 'pg345_cleaned.txt'\n",
    "if not raw_path.exists():\n",
    "    urlretrieve(url, raw_path)\n",
    "    print('Downloaded Dracula to', raw_path)\n",
    "else:\n",
    "    print('Dracula already downloaded:', raw_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "id": "1dfd4fd3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# download('https://www.gutenberg.org/cache/epub/345/pg345.txt');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "id": "f110f8bc",
   "metadata": {},
   "outputs": [],
   "source": [
    "def clean_file(infile, outfile):\n",
    "    \"\"\"Read infile, write to outfile skipping special lines.\"\"\"\n",
    "    with open(infile, encoding='utf8') as fin, open(outfile, 'w', encoding='utf8') as fout:\n",
    "        for line in fin:\n",
    "            if not is_special_line(line):\n",
    "                fout.write(line)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d759ef63",
   "metadata": {},
   "outputs": [],
   "source": [
    "def clean_file(input_file, output_file):\n",
    "    reader = open(input_file, encoding='utf-8')\n",
    "    writer = open(output_file, 'w')\n",
    "\n",
    "    for line in reader:\n",
    "        if is_special_line(line):\n",
    "            break\n",
    "\n",
    "    for line in reader:\n",
    "        if is_special_line(line):\n",
    "            break\n",
    "        writer.write(line)\n",
    "        \n",
    "    reader.close()\n",
    "    writer.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "id": "28577dec",
   "metadata": {},
   "outputs": [],
   "source": [
    "def is_special_line(line):\n",
    "    \"\"\"Return True if the line marks the start or end of the Gutenberg content.\"\"\"\n",
    "    return line.startswith('***')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "id": "d6fb49c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# def is_special_line(line):\n",
    "#     return line.strip().startswith('*** ')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "id": "9f689533",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cleaned file saved to /Users/tcn85/workspace/py/data/pg345_cleaned.txt\n"
     ]
    }
   ],
   "source": [
    "# Clean the Dracula text and save to data/pg345_cleaned.txt\n",
    "clean_file(raw_path, clean_path)\n",
    "print('Cleaned file saved to', clean_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "521a7c5a",
   "metadata": {},
   "source": [
    "Putting all that together, here's a function that loops through the lines in the book until it finds one that matches the given pattern, and returns the `Match` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "id": "dd3afac4",
   "metadata": {},
   "outputs": [],
   "source": [
    "def find_first(pattern, path=clean_path):\n",
    "    with open(path, encoding='utf8') as f:\n",
    "        for line in f:\n",
    "            result = re.search(pattern, line)\n",
    "            if result is not None:\n",
    "                return result"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52b080be",
   "metadata": {},
   "source": [
    "We can use it to find the first mention of a character."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "id": "a74228d0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'CHAPTER I. Jonathan Harker’s Journal\\n'"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = find_first('Harker')\n",
    "result.string"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "565413cf",
   "metadata": {},
   "source": [
    "For this example, we didn't have to use regular expressions -- we could have done the same thing more easily with the `in` operator.\n",
    "But regular expressions can do things the `in` operator cannot.\n",
    "\n",
    "For example, if the pattern includes the vertical bar character, `'|'`, it can match either the sequence on the left or the sequence on the right.\n",
    "Suppose we want to find the first mention of Mina Murray in the book, but we are not sure whether she is referred to by first name or last.\n",
    "We can use the following pattern, which matches either name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "id": "2d042bed",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'CHAPTER V. Letters—Lucy and Mina\\n'"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pattern = 'Mina|Murray'\n",
    "result = find_first(pattern)\n",
    "result.string"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b62fd404",
   "metadata": {},
   "source": [
    "We can use a pattern like this to see how many times a character is mentioned by either name.\n",
    "Here's a function that loops through the book and counts the number of lines that match the given pattern."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "id": "35dd291b",
   "metadata": {},
   "outputs": [],
   "source": [
    "def count_matches(pattern, path=clean_path):\n",
    "    count = 0\n",
    "    with open(path, encoding='utf8') as f:\n",
    "        for line in f:\n",
    "            result = re.search(pattern, line)\n",
    "            if result is not None:\n",
    "                count += 1\n",
    "    return count"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "adaf5bb1",
   "metadata": {},
   "source": [
    "Now let's see how many times Mina is mentioned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "id": "585882de",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "229"
      ]
     },
     "execution_count": 79,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count_matches('Mina|Murray')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81a35c6f",
   "metadata": {},
   "source": [
    "The special character `'^'` matches the beginning of a string, so we can find a line that starts with a given pattern."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "id": "de17fdda",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Dracula, jumping to his feet, said:--\\n'"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = find_first('^Dracula')\n",
    "result.string"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c7cfb3f",
   "metadata": {},
   "source": [
    "And the special character `'$'` matches the end of a string, so we can find a line that ends with a given pattern (ignoring the newline at the end)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "id": "b7dbd7ef",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'by five o’clock, we must start off; for it won’t do to leave Mrs. Harker\\n'"
      ]
     },
     "execution_count": 81,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = find_first('Harker$')\n",
    "result.string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "id": "5f14fc19",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Download and Clean Text\n",
    "# Difficulty: Intermediate\n",
    "# 1. Use raw_path and clean_path to print whether each file exists\n",
    "# 2. If clean_path does not exist, run clean_file(raw_path, clean_path)\n",
    "# 3. Print the size (in bytes) of clean_path\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "id": "92f55807",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True True\n",
      "852703\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "print(raw_path.exists(), clean_path.exists())\n",
    "if not clean_path.exists():\n",
    "    clean_file(raw_path, clean_path)\n",
    "print(clean_path.stat().st_size)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f67418ba",
   "metadata": {},
   "source": [
    "## String substitution\n",
    "\n",
    "Bram Stoker was born in Ireland, and when *Dracula* was published in 1897, he was living in England.\n",
    "So we would expect him to use the British spelling of words like \"centre\" and \"colour\".\n",
    "To check, we can use the following pattern, which matches either \"centre\" or the American spelling \"center\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "id": "a2557856",
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = 'cent(er|re)'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e197ea79",
   "metadata": {},
   "source": [
    "In this pattern, the parentheses enclose the part of the pattern the vertical bar applies to.\n",
    "So this pattern matches a sequence that starts with `'cent'` and ends with either `'er'` or `'re'`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "id": "9912bca3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'horseshoe of the Carpathians, as if it were the centre of some sort of\\n'"
      ]
     },
     "execution_count": 85,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = find_first(pattern)\n",
    "result.string"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "994b3902",
   "metadata": {},
   "source": [
    "As expected, he used the British spelling.\n",
    "\n",
    "We can also check whether he used the British spelling of \"colour\".\n",
    "The following pattern uses the special character `'?'`, which means that the previous character is optional."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "id": "5648ad9c",
   "metadata": {},
   "outputs": [],
   "source": [
    "pattern = 'colou?r'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64633322",
   "metadata": {},
   "source": [
    "This pattern matches either \"colour\" with the `'u'` or \"color\" without it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "id": "2caa4b8c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'undergarment with long double apron, front, and back, of coloured stuff\\n'"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result = find_first(pattern)\n",
    "line = result.string\n",
    "line"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3dbb91ce",
   "metadata": {},
   "source": [
    "Again, as expected, he used the British spelling.\n",
    "\n",
    "Now suppose we want to produce an edition of the book with American spellings.\n",
    "We can use the `sub` function in the `re` module, which does **string substitution**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "id": "c252a3b7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'undergarment with long double apron, front, and back, of colored stuff\\n'"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.sub(pattern, 'color', line)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2baef97d",
   "metadata": {},
   "source": [
    "The first argument is the pattern we want to find and replace, the second is what we want to replace it with, and the third is the string we want to search.\n",
    "In the result, you can see that \"colour\" has been replaced with \"color\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "id": "35c380ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "# I used this function to search for lines to use as examples\n",
    "\n",
    "def all_matches(pattern, path=clean_path):\n",
    "    with open(path, encoding='utf8') as f:\n",
    "        for line in f:\n",
    "            result = re.search(pattern, line)\n",
    "            if result:\n",
    "                print(line.strip())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "id": "53b797ca",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "weather. As I stood, the driver jumped again into his seat and shook the\n",
      "weatherworn, was still complete; but it was evidently many a day since\n",
      "it is a buoy with a bell, which swings in bad weather, and sends in a\n",
      "am awakened by her moving about the room. Fortunately, the weather is so\n",
      "learn the weather signs. To-day is a grey day, and the sun as I write is\n",
      "experienced here, with results both strange and unique. The weather had\n",
      "kept watch on weather signs from the East Cliff, foretold in an emphatic\n",
      "_22 July_.--Rough weather last three days, and all hands busy with\n",
      "weather. Passed Gibralter and out through Straits. All well.\n",
      "and entering on the Bay of Biscay with wild weather ahead, and yet last\n",
      "weather influences as we know that the Count can bring to bear; and if\n",
      "that I am fully armed as there may be wolves; the weather is getting\n"
     ]
    }
   ],
   "source": [
    "### e.g., \n",
    "\n",
    "all_matches('weather')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "id": "1832203b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Here's the pattern I used (which uses some features we haven't seen)\n",
    "\n",
    "# names = r'(?<!\\.\\s)[A-Z][a-zA-Z]+'\n",
    "\n",
    "# all_matches(names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "id": "656f4723",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: String Substitution\n",
    "# Difficulty: Intermediate\n",
    "sample = \"The colour of the city centre changed overnight.\"\n",
    "# 1. Replace British spellings with American spellings using regex:\n",
    "#    colour -> color, centre -> center\n",
    "# 2. Print the transformed sentence\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "id": "ec5f54c7",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The color of the city center changed overnight.\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "import re\n",
    "\n",
    "sample = \"The colour of the city centre changed overnight.\"\n",
    "sample = re.sub(r'colou?r', 'color', sample)\n",
    "sample = re.sub(r'cent(er|re)', 'center', sample)\n",
    "print(sample)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5fd5be89",
   "metadata": {},
   "source": [
    "## `re.fullmatch()` for Validation\n",
    "\n",
    "`re.fullmatch(pattern, text)` succeeds only if the **entire** string matches the pattern.\n",
    "This is the right tool for validation tasks (IDs, simple emails, phone formats, etc.)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "id": "a50bff32",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "EMP-0001 True\n",
      "EMP-12 False\n",
      "AEMP-0001 False\n",
      "EMP-12345 False\n"
     ]
    }
   ],
   "source": [
    "employee_id_pattern = r'EMP-\\d{4}'\n",
    "ids = ['EMP-0001', 'EMP-12', 'AEMP-0001', 'EMP-12345']\n",
    "\n",
    "for emp_id in ids:\n",
    "    print(emp_id, bool(re.fullmatch(employee_id_pattern, emp_id)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "id": "4f4e253b",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Full String Validation\n",
    "# Difficulty: Intermediate\n",
    "codes = ['CS-101', 'MATH-240', 'CS101', 'EE-7']\n",
    "# A valid course code must be: 2-4 uppercase letters, a dash, then 3 digits.\n",
    "# 1. Write the regex pattern\n",
    "# 2. Print each code with True/False using re.fullmatch\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "id": "98b2b2e6",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CS-101 True\n",
      "MATH-240 True\n",
      "CS101 False\n",
      "EE-7 False\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "codes = ['CS-101', 'MATH-240', 'CS101', 'EE-7']\n",
    "pattern = r'[A-Z]{2,4}-\\d{3}'\n",
    "for code_str in codes:\n",
    "    print(code_str, bool(re.fullmatch(pattern, code_str)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62aa23e7",
   "metadata": {},
   "source": [
    "## Quick Reference\n",
    "\n",
    "**Characters**\n",
    "\n",
    "| Pattern | Meaning | Example match |\n",
    "|---|---|---|\n",
    "| `.` | Any character (except newline) | `c.t` → `cat`, `cot` |\n",
    "| `\\d` | Digit | `7` |\n",
    "| `\\w` | Word character (letter, digit, underscore) | `A`, `x`, `9`, `_` |\n",
    "| `\\s` | Whitespace (space, tab, newline) | ` ` |\n",
    "| `[abc]` | Character class — any one of `a`, `b`, `c` | `a` |\n",
    "| `[^abc]` | Negated class — any character except `a`, `b`, `c` | `d` |\n",
    "\n",
    "**Quantifiers**\n",
    "\n",
    "| Pattern | Meaning | Example |\n",
    "|---|---|---|\n",
    "| `*` | 0 or more | `ab*` → `a`, `ab`, `abb` |\n",
    "| `+` | 1 or more | `ab+` → `ab`, `abb` |\n",
    "| `?` | 0 or 1 (optional) | `colou?r` → `color`, `colour` |\n",
    "| `{n}` | Exactly n | `\\d{3}` → `123` |\n",
    "| `{n,m}` | Between n and m | `\\d{2,4}` → `12`, `123` |\n",
    "| `*?` `+?` | Lazy (match as little as possible) | `<.+?>` |\n",
    "\n",
    "**Anchors**\n",
    "\n",
    "| Pattern | Meaning |\n",
    "|---|---|\n",
    "| `^` | Start of string (or line with `re.M`) |\n",
    "| `$` | End of string (or line) |\n",
    "| `\\b` | Word boundary |\n",
    "\n",
    "**Groups**\n",
    "\n",
    "| Syntax | Meaning |\n",
    "|---|---|\n",
    "| `(...)` | Capturing group |\n",
    "| `(?:...)` | Non-capturing group |\n",
    "| `(?P<name>...)` | Named group |\n",
    "| `(?=...)` | Lookahead |\n",
    "| `(?<=...)` | Lookbehind |\n",
    "\n",
    "**Flags**\n",
    "\n",
    "| Flag | Shorthand | Meaning |\n",
    "|---|---|---|\n",
    "| `re.IGNORECASE` | `re.I` | Case-insensitive matching |\n",
    "| `re.MULTILINE` | `re.M` | `^`/`$` match line start/end |\n",
    "| `re.DOTALL` | `re.S` | `.` matches newline too |\n",
    "| `re.VERBOSE` | `re.X` | Allow comments/whitespace in pattern |"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
