{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "59a8621b",
   "metadata": {},
   "source": [
    "# Text Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "9eb172a5",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "from pathlib import Path\n",
    "\n",
    "current = Path.cwd()\n",
    "for parent in [current, *current.parents]:\n",
    "    if (parent / '_config.yml').exists():\n",
    "        project_root = parent  # ← Add project root, not chapters\n",
    "        break\n",
    "else:\n",
    "    project_root = Path.cwd().parent.parent\n",
    "\n",
    "sys.path.insert(0, str(project_root))\n",
    "\n",
    "from shared import thinkpython, diagram, jupyturtle\n",
    "from shared.download import download\n",
    "\n",
    "# Register as top-level modules so direct imports work in subsequent cells\n",
    "sys.modules['thinkpython'] = thinkpython\n",
    "sys.modules['diagram'] = diagram\n",
    "sys.modules['jupyturtle'] = jupyturtle\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1887a7ae",
   "metadata": {},
   "source": [
    "At this point we have covered Python's core data structures -- lists, dictionaries, and tuples -- and some algorithms that use them.\n",
    "In this chapter, we'll use them to explore text analysis and Markov generation:\n",
    "\n",
    "* Text analysis is a way to describe the statistical relationships between the words in a document, like the probability that one word is followed by another, and\n",
    "\n",
    "* Markov generation is a way to generate new text with words and phrases similar to the original text.\n",
    "\n",
    "These algorithms are similar to parts of a Large Language Model (LLM), which is the key component of a chatbot.\n",
    "\n",
    "We'll start by counting the number of times each word appears in a book.\n",
    "Then we'll look at pairs of words, and make a list of the words that can follow each word.\n",
    "We'll make a simple version of a Markov generator, and as an exercise, you'll have a chance to make a more general version."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0e3811b8",
   "metadata": {},
   "source": [
    "## Unique words\n",
    "\n",
    "As a first step toward text analysis, let's read a book -- [*The Strange Case Of Dr. Jekyll And Mr. Hyde*](https://archive.org/stream/thestrangecaseof00043gut/pg43.txt#:~:text=The%20Project%20Gutenberg%20EBook%20of,with%20almost%20no%20restrictions%20whatsoever.) by Robert Louis Stevenson -- and count the number of unique words.\n",
    "Instructions for downloading the book are in the notebook for this chapter."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6567e1bf",
   "metadata": {
    "tags": []
   },
   "source": [
    "The following cell downloads the book from Project Gutenberg."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "4cd1c980",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Already downloaded: /Users/tcn85/workspace/py/data/pg43.txt\n"
     ]
    }
   ],
   "source": [
    "\n",
    "data_dir = project_root / 'data'\n",
    "data_dir.mkdir(parents=True, exist_ok=True) # Create the data directory if it doesn't exist\n",
    "raw_path = data_dir / 'pg43.txt'            ### This is the raw text file downloaded from Project Gutenberg\n",
    "clean_path = data_dir / 'dr_jekyll.txt'     ### This will the cleaned text file that we will use for analysis\n",
    "\n",
    "if not raw_path.exists():\n",
    "    download('https://www.gutenberg.org/cache/epub/43/pg43.txt', str(raw_path))\n",
    "    print('Downloaded to', raw_path)\n",
    "else:\n",
    "    print('Already downloaded:', raw_path)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5465ab1d",
   "metadata": {
    "tags": []
   },
   "source": [
    "The version available from Project Gutenberg includes information about the book at the beginning and license information at the end.\n",
    "We'll use `clean_file` from Chapter 8 to remove this material and write a \"clean\" file that contains only the text of the book."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "52ebfe94",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def is_special_line(line):\n",
    "    return line.strip().startswith('*** ')  ### This is the marker for the start and end of the actual text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "49cfc352",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def clean_file(input_file, output_file):\n",
    "    reader = open(input_file, encoding='utf-8') # Open the input file for reading with UTF-8 encoding\n",
    "    writer = open(output_file, 'w')             # Open the output file for writing\n",
    "                                                # reader and writer are file objects that we can use to read from and write to the files, respectively\n",
    "                                                # read/write operations are line by line, so we can use a for loop to iterate through the lines\n",
    "    for line in reader:\n",
    "        if is_special_line(line):   ### This is the marker for the start and end of the actual text\n",
    "            break                   ### Stop reading until we find the start of the actual text\n",
    "\n",
    "    for line in reader:\n",
    "        if is_special_line(line):   ### This is the marker for the start and end of the actual text\n",
    "            break\n",
    "        writer.write(line)          ### Write the line to the output file if it's not a special line\n",
    "        \n",
    "    reader.close()                  # Close the input file\n",
    "    writer.close()                  # Close the output file   \n",
    "    \n",
    "    ### using with statement to automatically close files\n",
    "    # with open(input_file, encoding='utf-8') as reader, open(output_file, 'w') as writer:\n",
    "    #     for line in reader:\n",
    "    #         if is_special_line(line):\n",
    "    #             break\n",
    "    #     for line in reader:\n",
    "    #         if is_special_line(line):\n",
    "    #             break\n",
    "    #         writer.write(line)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "44e53ce6",
   "metadata": {},
   "outputs": [],
   "source": [
    "filename = clean_path   ### 'drjekyll.txt' will be the cleaned whole text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "50d1fafa",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "The Strange Case Of Dr. Jekyll And Mr. Hyde\n",
      "\n",
      "by Robert Louis Stevenson\n",
      "\n",
      "\n",
      "Contents\n",
      "\n",
      "\n",
      " STORY OF THE DOOR\n",
      "\n",
      " SEARCH FOR MR. HYDE\n",
      "\n",
      " DR. JEKYLL WAS QUITE AT EASE\n",
      "\n",
      " THE CAREW MURDER CASE\n",
      "\n",
      " INCIDENT OF THE LETTER\n",
      "\n",
      " INCIDENT OF DR. LANYON\n",
      "\n"
     ]
    }
   ],
   "source": [
    "clean_file(raw_path, filename)  ### read from pg43.txt, \n",
    "                                ### write to dr_jekyll.txt\n",
    "count = 0                       ### avoid reading the entire file\n",
    "for line in open(filename):\n",
    "    print(line, end='')\n",
    "    count += 1\n",
    "    if count > 20:              ### read the first 20 lines \n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc66d7e2",
   "metadata": {},
   "source": [
    "We'll use a `for` loop to read lines from the file and `split` to divide the lines into words.\n",
    "Then, to keep track of unique words, we'll store each word as a key in a dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "16d24028",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6042"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unique_words = {}\n",
    "for line in open(filename):     ### filename is dr_jekyll.txt, cleaned text\n",
    "\n",
    "    seq = line.split()\n",
    "    for word in seq:\n",
    "        unique_words[word] = 1\n",
    "\n",
    "len(unique_words)\n",
    "# unique_words"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85171a3a",
   "metadata": {},
   "source": [
    "The length of the dictionary is the number of unique words -- about `6000` by this way of counting.\n",
    "But if we inspect them, we'll see that some are not valid words.\n",
    "\n",
    "For example, let's look at the longest words in `unique_words`.\n",
    "We can use `sorted` to sort the words, passing the `len` function as a keyword argument so the words are sorted by length."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "1668e6bd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['chocolate-coloured',\n",
       " 'superiors—behold!”',\n",
       " 'coolness—frightened',\n",
       " 'gentleman—something',\n",
       " 'pocket-handkerchief.']"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sorted(unique_words, key=len)[-5:]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "795f5327",
   "metadata": {},
   "source": [
    "The slice index, `[-5:]`, selects the last `5` elements of the sorted list, which are the longest words. \n",
    "\n",
    "The list includes some legitimately long words, like \"circumscription\", and some hyphenated words, like \"chocolate-coloured\".\n",
    "But some of the longest \"words\" are actually two words separated by a dash.\n",
    "And other words include punctuation like periods, exclamation points, and quotation marks.\n",
    "\n",
    "So, before we move on, let's deal with dashes and other punctuation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "b389e6d5",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Counting Unique Words\n",
    "\n",
    "text = \"to be or not to be that is the question to be\"\n",
    "# 1. Build a dictionary where each key is a unique word from 'text'\n",
    "# 2. Print the total number of unique words\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "363b5528",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "8\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "text = \"to be or not to be that is the question to be\"\n",
    "unique = {}\n",
    "for word in text.split():\n",
    "    unique[word] = 1\n",
    "print(len(unique))  # 8\n"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
