{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7ef40180",
   "metadata": {},
   "source": [
    "# Word Frequencies\n",
    "\n",
    "The following loop computes the frequency of each unique word that is very similar to the {ref}`value_counts() <value_counts>` function in the dictionary section when we count the letter frequency of \"brontosaurus\" and \"mississippi\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e56469d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "from pathlib import Path\n",
    "\n",
    "# Find project root by looking for _config.yml\n",
    "current = Path.cwd()\n",
    "for parent in [current, *current.parents]:\n",
    "    if (parent / '_config.yml').exists():\n",
    "        project_root = parent\n",
    "        break\n",
    "else:\n",
    "    project_root = Path.cwd().parent.parent\n",
    "\n",
    "# Add project root to path\n",
    "sys.path.insert(0, str(project_root))\n",
    "\n",
    "# Import shared teaching helpers and cell magics\n",
    "from shared import thinkpython, diagram, jupyturtle, structshape\n",
    "from shared.download import download\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "55ace77f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "import unicodedata\n",
    "\n",
    "# Text-analysis setup shared by the split sections.\n",
    "data_dir = project_root / 'data'\n",
    "data_dir.mkdir(parents=True, exist_ok=True)\n",
    "raw_path = data_dir / 'pg43.txt'\n",
    "filename = data_dir / 'dr_jekyll.txt'\n",
    "\n",
    "if not filename.exists():\n",
    "    if not raw_path.exists():\n",
    "        download('https://www.gutenberg.org/cache/epub/43/pg43.txt', str(raw_path))\n",
    "\n",
    "    with open(raw_path, encoding='utf-8') as reader, open(filename, 'w', encoding='utf-8') as writer:\n",
    "        in_body = False\n",
    "        for line in reader:\n",
    "            if line.startswith('***'):\n",
    "                if not in_body:\n",
    "                    in_body = True\n",
    "                    continue\n",
    "                break\n",
    "            if in_body:\n",
    "                writer.write(line)\n",
    "\n",
    "def split_line(line):\n",
    "    return line.replace('—', ' ').split()\n",
    "\n",
    "punc_marks = {}\n",
    "for line in open(filename, encoding='utf-8'):\n",
    "    for char in line:\n",
    "        category = unicodedata.category(char)\n",
    "        if category.startswith('P'):\n",
    "            punc_marks[char] = 1\n",
    "\n",
    "punctuation = ''.join(punc_marks)\n",
    "\n",
    "def clean_word(word):\n",
    "    return word.strip(punctuation).lower()\n",
    "\n",
    "word_counter = {}\n",
    "for line in open(filename, encoding='utf-8'):\n",
    "    for word in split_line(line):\n",
    "        word = clean_word(word)\n",
    "        if word:\n",
    "            word_counter[word] = word_counter.get(word, 0) + 1\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "4fba7d1c",
   "metadata": {},
   "outputs": [],
   "source": [
    "word_counter = {}\n",
    "for line in open(filename):\n",
    "    for word in split_line(line):\n",
    "        word = clean_word(word)\n",
    "        if word not in word_counter:\n",
    "            word_counter[word] = 1\n",
    "        else:\n",
    "            word_counter[word] += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd680b81",
   "metadata": {},
   "source": [
    "The first time we see a word, we initialize its frequency to `1`. If we see the same word again later, we increment its frequency.\n",
    "\n",
    "To see which words appear most often, we can use `items` to get the key-value pairs from `word_counter`, and sort them by the second element of the pair, which is the frequency.\n",
    "First we'll define a function that selects the second element just like the tuple chapter {ref}`second_element() <second_element>`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "4be34c95",
   "metadata": {},
   "outputs": [],
   "source": [
    "def second_element(t):\n",
    "    return t[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b15a5bd6",
   "metadata": {},
   "source": [
    "Now we can use `sorted` with two keyword arguments:\n",
    "\n",
    "* `key=second_element` means the items will be sorted according to the frequencies of the words.\n",
    "\n",
    "* `reverse=True` means the items will be sorted in reverse order, with the most frequent words first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "8efe7c4c",
   "metadata": {},
   "outputs": [],
   "source": [
    "items = sorted(word_counter.items(), key=second_element, reverse=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db6812e2",
   "metadata": {},
   "source": [
    "Here are the five most frequent words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "79c17341",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1614\tthe\n",
      "972\tand\n",
      "941\tof\n",
      "640\tto\n",
      "640\ti\n"
     ]
    }
   ],
   "source": [
    "for word, freq in items[:5]:\n",
    "    print(freq, word, sep='\\t')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "e3edd40e",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Word Frequency Counter\n",
    "\n",
    "text = \"the cat sat on the mat and the cat sat\"\n",
    "# 1. Build a word frequency dictionary (word → count) for 'text'\n",
    "# 2. Print the top 3 most frequent words with their counts (tab-separated)\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "9d80bd53",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3\tthe\n",
      "2\tcat\n",
      "2\tsat\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "text = \"the cat sat on the mat and the cat sat\"\n",
    "counter = {}\n",
    "for word in text.split():\n",
    "    if word not in counter:\n",
    "        counter[word] = 1\n",
    "    else:\n",
    "        counter[word] += 1\n",
    "items = sorted(counter.items(), key=second_element, reverse=True)\n",
    "for word, freq in items[:3]:\n",
    "    print(freq, word, sep='\\t')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "551e81bb",
   "metadata": {},
   "source": [
    "In the next section, we'll encapsulate this loop in a function.\n",
    "And we'll use it to demonstrate a new feature -- optional parameters."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "45243ccc",
   "metadata": {},
   "source": [
    "## Optional Parameters\n",
    "\n",
    "We've used built-in functions that take optional parameters.\n",
    "For example, `round` takes an optional parameters called `ndigits` that indicates how many decimal places to keep."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "838bcb4f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3.142"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "round(3.141592653589793, ndigits=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ae60945",
   "metadata": {},
   "source": [
    "But it's not just built-in functions -- we can write functions with optional parameters, too.\n",
    "For example, the following function takes two parameters, `word_counter` and `num`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "90c45e7e",
   "metadata": {},
   "outputs": [],
   "source": [
    "def print_most_common(word_counter, num=5):\n",
    "    items = sorted(word_counter.items(), key=second_element, reverse=True)\n",
    "\n",
    "    for word, freq in items[:num]:\n",
    "        print(freq, word, sep='\\t')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78cb1531",
   "metadata": {},
   "source": [
    "The second parameter looks like an assignment statement, but it's not -- it's an optional parameter.\n",
    "\n",
    "If you call this function with one argument, `num` gets the **default value**, which is `5`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "e106be95",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1614\tthe\n",
      "972\tand\n",
      "941\tof\n",
      "640\tto\n",
      "640\ti\n"
     ]
    }
   ],
   "source": [
    "print_most_common(word_counter)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29753ad6",
   "metadata": {},
   "source": [
    "If you call this function with two arguments, the second argument gets assigned to `num` instead of the default value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "8101a510",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1614\tthe\n",
      "972\tand\n",
      "941\tof\n"
     ]
    }
   ],
   "source": [
    "print_most_common(word_counter, 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9bf907b",
   "metadata": {},
   "source": [
    "In that case, we would say the optional argument **overrides** the default value.\n",
    "\n",
    "If a function has both required and optional parameters, all of the required parameters have to come first, followed by the optional ones."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c046117b",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%expect SyntaxError\n",
    "def bad_function(n=5, word_counter):\n",
    "    return None\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "bd2100a1",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Function with Optional Parameter\n",
    "\n",
    "word_counter = {'the': 3, 'cat': 2, 'sat': 2, 'on': 1, 'mat': 1, 'and': 1}\n",
    "\n",
    "# write a function called print_least_common that takes a word_counter dictionary\n",
    "# and an optional parameter num (default 3)\n",
    "# the function should print the num least frequent words in the word_counter,\n",
    "# one per line, with the frequency and word separated by a tab\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "695ba623",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\ton\n",
      "1\tmat\n",
      "1\tand\n",
      "\n",
      "1\ton\n",
      "1\tmat\n",
      "1\tand\n",
      "2\tcat\n",
      "2\tsat\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "\n",
    "word_counter = {'the': 3, 'cat': 2, 'sat': 2, 'on': 1, 'mat': 1, 'and': 1}\n",
    "\n",
    "def print_least_common(word_counter, num=3):\n",
    "    items = sorted(word_counter.items(), key=lambda x: x[1])\n",
    "    for word, freq in items[:num]:\n",
    "        print(freq, word, sep='\\t')\n",
    "\n",
    "\n",
    "print_least_common(word_counter)\n",
    "print()\n",
    "print_least_common(word_counter, 5)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f450df2",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Dictionary Subtraction\n",
    "\n",
    "Suppose we want to spell-check a book -- that is, find a list of words that might be misspelled.\n",
    "One way to do that is to find words in the book that don't appear in a list of valid words. Now we'll use this list to spell-check Robert Louis Stevenson. We can think of this problem as set subtraction -- that is, we want to find all the words from one set (the words in the book) that are not in the other (the words in the list)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3804d82",
   "metadata": {
    "tags": []
   },
   "source": [
    "The following cell downloads the word list. \n",
    "\n",
    "(*I am using `../../data` as the data folder and your data folder maybe different from mine if you download the notebook and place it in your project folder. In that case you do not have to specify the data folder if you have the `words.txt` downloaded into the same folder as the notebook.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "edd8ff1c",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1614\tthe\n",
      "972\tand\n",
      "941\tof\n",
      "640\tto\n",
      "640\ti\n"
     ]
    }
   ],
   "source": [
    "if __name__ == '__main__':  ### This is a common Python idiom that checks if the script is being run directly (as the main program) rather than imported as a module. If this condition is true, the code inside this block will be executed.\n",
    "    print_most_common(word_counter)\n",
    "    download('https://raw.githubusercontent.com/AllenDowney/ThinkPython/v3/words.txt', '../../data');"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a46556c",
   "metadata": {},
   "source": [
    "We can read the contents of `words.txt` and split it into a list of strings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "67ef3e08",
   "metadata": {},
   "outputs": [],
   "source": [
    "### This is a common way to read a file and split it into a list of words. \n",
    "### However, it does not properly close the file after reading, which can \n",
    "### lead to resource leaks. Using a with statement is a better practice as \n",
    "### it ensures that the file is properly closed after its suite finishes, \n",
    "### even if an error occurs.\n",
    "# word_list = open('../../data/words.txt').read().split()   \n",
    "\n",
    "with open('../../data/words.txt') as f:\n",
    "    word_list = f.read().split()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22becbab",
   "metadata": {},
   "source": [
    "Then we'll store the words as keys in a dictionary so we can use the `in` operator to check quickly whether a word is valid."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "471d58e9",
   "metadata": {},
   "outputs": [],
   "source": [
    "valid_words = {}            ### another dictionary to store the valid words from the word list\n",
    "for word in word_list:\n",
    "    valid_words[word] = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94cc7c61",
   "metadata": {},
   "source": [
    "Now, to identify words that appear in the book but not in the word list, we'll use `subtract`, which takes two dictionaries as parameters and returns a new dictionary that contains all the keys from one that are not in the other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "4d4c3538",
   "metadata": {},
   "outputs": [],
   "source": [
    "def subtract(d1, d2):\n",
    "    res = {}\n",
    "    for key in d1:\n",
    "        if key not in d2:\n",
    "            res[key] = d1[key]\n",
    "    return res"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e70c63b4",
   "metadata": {},
   "source": [
    "Here's how we use it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "8b42e014",
   "metadata": {},
   "outputs": [],
   "source": [
    "diff = subtract(word_counter, valid_words)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8ada7bd",
   "metadata": {},
   "source": [
    "To get a sample of words that might be misspelled, we can print the most common words in `diff`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "id": "f48be152",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "640\ti\n",
      "628\ta\n",
      "128\tutterson\n",
      "124\tmr\n",
      "98\thyde\n"
     ]
    }
   ],
   "source": [
    "print_most_common(diff)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "deeec418",
   "metadata": {},
   "source": [
    "The most common \"misspelled\" words are mostly names and a few single-letter words (Mr. Utterson is Dr. Jekyll's friend and lawyer).\n",
    "\n",
    "If we select words that only appear once, they are more likely to be actual misspellings.\n",
    "We can do that by looping through the items and making a list of words with frequency `1`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "5716f967",
   "metadata": {},
   "outputs": [],
   "source": [
    "singletons = []\n",
    "for word, freq in diff.items():\n",
    "    if freq == 1:\n",
    "        singletons.append(word)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98ae9281",
   "metadata": {},
   "source": [
    "Here are the last few elements of the list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "b37219f5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['gesticulated', 'abjection', 'circumscription', 'reindue', 'fearstruck']"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "singletons[-5:]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5040834",
   "metadata": {},
   "source": [
    "Most of them are valid words that are not in the word list.\n",
    "But `'reindue'` appears to be a misspelling of `'reinduce'`, so at least we found one legitimate error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "91f9488c",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Dictionary Subtraction\n",
    "\n",
    "d1 = {'apple': 3, 'banana': 1, 'cherry': 4, 'date': 2}\n",
    "d2 = {'banana': 1, 'cherry': 4}\n",
    "# Use subtract(d1, d2) to find keys in d1 that are not in d2.\n",
    "# Print the resulting dictionary.\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "5f9577f4",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'apple': 3, 'date': 2}\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "d1 = {'apple': 3, 'banana': 1, 'cherry': 4, 'date': 2}\n",
    "d2 = {'banana': 1, 'cherry': 4}\n",
    "result = subtract(d1, d2)\n",
    "print(result)   # {'apple': 3, 'date': 2}\n"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
