{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "bf89fafa",
   "metadata": {},
   "source": [
    "# Punctuation\n",
    "\n",
    "To identify the words in the text, we need to deal with two issues:\n",
    "\n",
    "* When a dash appears in a line, we should replace it with a space -- then when we use `split`, the words will be separated.\n",
    "\n",
    "* After splitting the words, we can use `strip` to remove punctuation.\n",
    "\n",
    "To handle the first issue, we can use the following function, which takes a string, replaces dashes with spaces, splits the string, and returns the resulting list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd455119",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "from pathlib import Path\n",
    "\n",
    "# Find project root by looking for _config.yml\n",
    "current = Path.cwd()\n",
    "for parent in [current, *current.parents]:\n",
    "    if (parent / '_config.yml').exists():\n",
    "        project_root = parent\n",
    "        break\n",
    "else:\n",
    "    project_root = Path.cwd().parent.parent\n",
    "\n",
    "# Add project root to path\n",
    "sys.path.insert(0, str(project_root))\n",
    "\n",
    "# Import shared teaching helpers and cell magics\n",
    "from shared import thinkpython, diagram, jupyturtle, structshape\n",
    "from shared.download import download\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "770258b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "import unicodedata\n",
    "\n",
    "# Text-analysis setup shared by the split sections.\n",
    "data_dir = project_root / 'data'\n",
    "data_dir.mkdir(parents=True, exist_ok=True)\n",
    "raw_path = data_dir / 'pg43.txt'\n",
    "filename = data_dir / 'dr_jekyll.txt'\n",
    "\n",
    "if not filename.exists():\n",
    "    if not raw_path.exists():\n",
    "        download('https://www.gutenberg.org/cache/epub/43/pg43.txt', str(raw_path))\n",
    "\n",
    "    with open(raw_path, encoding='utf-8') as reader, open(filename, 'w', encoding='utf-8') as writer:\n",
    "        in_body = False\n",
    "        for line in reader:\n",
    "            if line.startswith('***'):\n",
    "                if not in_body:\n",
    "                    in_body = True\n",
    "                    continue\n",
    "                break\n",
    "            if in_body:\n",
    "                writer.write(line)\n",
    "\n",
    "def split_line(line):\n",
    "    return line.replace('—', ' ').split()\n",
    "\n",
    "punc_marks = {}\n",
    "for line in open(filename, encoding='utf-8'):\n",
    "    for char in line:\n",
    "        category = unicodedata.category(char)\n",
    "        if category.startswith('P'):\n",
    "            punc_marks[char] = 1\n",
    "\n",
    "punctuation = ''.join(punc_marks)\n",
    "\n",
    "def clean_word(word):\n",
    "    return word.strip(punctuation).lower()\n",
    "\n",
    "word_counter = {}\n",
    "for line in open(filename, encoding='utf-8'):\n",
    "    for word in split_line(line):\n",
    "        word = clean_word(word)\n",
    "        if word:\n",
    "            word_counter[word] = word_counter.get(word, 0) + 1\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "ed5f0a43",
   "metadata": {},
   "outputs": [],
   "source": [
    "def split_line(line):\n",
    "    return line.replace('—', ' ').split()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5decdec",
   "metadata": {},
   "source": [
    "Notice that `split_line` only replaces dashes, not hyphens.\n",
    "Here's an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "a9df2aeb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['coolness', 'frightened']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "split_line('coolness—frightened')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d9eb318",
   "metadata": {},
   "source": [
    "Now, to remove punctuation from the beginning and end of each word, we can use `strip`, but we need a list of characters that are considered punctuation.\n",
    "\n",
    "Characters in Python strings are in Unicode, which is an international standard used to represent letters in nearly every alphabet, numbers, symbols, punctuation marks, and more.\n",
    "The `unicodedata` module provides a `category` function we can use to tell which characters are punctuation.\n",
    "Given a letter, it returns a string with information about what category the letter is in."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "b138b123",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Lu'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import unicodedata\n",
    "\n",
    "unicodedata.category('A')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "994835ea",
   "metadata": {},
   "source": [
    "The category string of `'A'` is `'Lu'` -- the `'L'` means it is a letter and the `'u'` means it is uppercase.\n",
    "\n",
    "The category string of `'.'` is `'Po'` -- the `'P'` means it is punctuation and the `'o'` means its subcategory is \"other\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "fe65df44",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Po'"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unicodedata.category('.')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03773b9b",
   "metadata": {},
   "source": [
    "We can find the punctuation marks in the book by checking for characters with categories that begin with `'P'`.\n",
    "The following loop stores the unique punctuation marks in a dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "b47a87cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "punc_marks = {}\n",
    "for line in open(filename):         ### filename is dr_jekyll.txt, cleaned text\n",
    "    for char in line:\n",
    "        category = unicodedata.category(char)\n",
    "        if category.startswith('P'):\n",
    "            punc_marks[char] = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6741dfa",
   "metadata": {},
   "source": [
    "To make a list of punctuation marks, we can join the keys of the dictionary into a string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "348949be",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ".’;,-“”:?—‘!()_\n"
     ]
    }
   ],
   "source": [
    "punctuation = ''.join(punc_marks)\n",
    "print(punctuation)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6af8d5a2",
   "metadata": {},
   "source": [
    "Now that we know which characters in the book are punctuation, we can write a function that takes a word, strips punctuation from the beginning and end, and converts it to lower case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "06121901",
   "metadata": {},
   "outputs": [],
   "source": [
    "def clean_word(word):\n",
    "    return word.strip(punctuation).lower()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58a78cb1",
   "metadata": {},
   "source": [
    "Here's an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "881ed9f8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'behold'"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clean_word('“Behold!”')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "314e4fbd",
   "metadata": {},
   "source": [
    "Because `strip` removes characters from the beginning and end, it leaves hyphenated words alone."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "ab5d2fed",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'pocket-handkerchief'"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clean_word('pocket-handkerchief')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99050f8a",
   "metadata": {},
   "source": [
    "Now here's a loop that uses `split_line` and `clean_word` to identify the unique words in the book."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "2fdfb936",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4005"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unique_words2 = {}\n",
    "for line in open(filename):\n",
    "    for word in split_line(line):   ### split_line handles the em dash\n",
    "        word = clean_word(word)      ### removes punctuation, lowercase  \n",
    "        unique_words2[word] = 1\n",
    "\n",
    "len(unique_words2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "992e5466",
   "metadata": {},
   "source": [
    "With this stricter definition of what a word is, there are about 4000 unique words. And we can confirm that the list of longest words has been cleaned up.\n",
    "\n",
    "`key=len` tells `sorted()` to sort by the length of each word.It calls `len()` on each word and sorts by that number from shortest to longest by default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "3104d191",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['circumscription',\n",
       " 'unimpressionable',\n",
       " 'fellow-creatures',\n",
       " 'chocolate-coloured',\n",
       " 'pocket-handkerchief']"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sorted(unique_words2, key=len)[-5:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "6e9b22ce",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Cleaning Words\n",
    "\n",
    "words = ['\"Hello,\"', 'world!', 'chocolate-coloured', '\"Behold!\"', 'pocket-handkerchief']\n",
    "# Use clean_word() to process each word and print the cleaned version.\n",
    "# Note: clean_word() strips punctuation from both ends and lowercases.\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "66d65b4b",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\"hello,\"\n",
      "world\n",
      "chocolate-coloured\n",
      "\"behold!\"\n",
      "pocket-handkerchief\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "words = ['\"Hello,\"', 'world!', 'chocolate-coloured', '\"Behold!\"', 'pocket-handkerchief']\n",
    "for word in words:\n",
    "    print(clean_word(word))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8014c330",
   "metadata": {},
   "source": [
    "Now let's see how many times each word is used."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
