{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "afcbbe19",
   "metadata": {},
   "source": [
    "# Random numbers\n",
    "\n",
    "As a step toward Markov text generation, next we'll choose a random sequence of words from `word_counter`.\n",
    "But first let's talk about randomness.\n",
    "\n",
    "Given the same inputs, most computer programs are **deterministic**, which means they generate the same outputs every time.\n",
    "Determinism is usually a good thing, since we expect the same calculation to yield the same result.\n",
    "For some applications, though, we want the computer to be unpredictable.\n",
    "Games are one example, but there are more.\n",
    "\n",
    "Making a program truly nondeterministic turns out to be difficult, but there are ways to fake it.\n",
    "One is to use algorithms that generate **pseudorandom** numbers.\n",
    "Pseudorandom numbers are not truly random because they are generated by a deterministic computation, but just by looking at the numbers it is all but impossible to distinguish them from random.\n",
    "\n",
    "The `random` module provides functions that generate pseudorandom numbers -- which I will simply call \"random\" from here on.\n",
    "We can import it like this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e0a1472c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "from pathlib import Path\n",
    "\n",
    "# Find project root by looking for _config.yml\n",
    "current = Path.cwd()\n",
    "for parent in [current, *current.parents]:\n",
    "    if (parent / '_config.yml').exists():\n",
    "        project_root = parent\n",
    "        break\n",
    "else:\n",
    "    project_root = Path.cwd().parent.parent\n",
    "\n",
    "# Add project root to path\n",
    "sys.path.insert(0, str(project_root))\n",
    "\n",
    "# Import shared teaching helpers and cell magics\n",
    "from shared import thinkpython, diagram, jupyturtle, structshape\n",
    "from shared.download import download\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "902702b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "import unicodedata\n",
    "\n",
    "# Text-analysis setup shared by the split sections.\n",
    "data_dir = project_root / 'data'\n",
    "data_dir.mkdir(parents=True, exist_ok=True)\n",
    "raw_path = data_dir / 'pg43.txt'\n",
    "filename = data_dir / 'dr_jekyll.txt'\n",
    "\n",
    "if not filename.exists():\n",
    "    if not raw_path.exists():\n",
    "        download('https://www.gutenberg.org/cache/epub/43/pg43.txt', str(raw_path))\n",
    "\n",
    "    with open(raw_path, encoding='utf-8') as reader, open(filename, 'w', encoding='utf-8') as writer:\n",
    "        in_body = False\n",
    "        for line in reader:\n",
    "            if line.startswith('***'):\n",
    "                if not in_body:\n",
    "                    in_body = True\n",
    "                    continue\n",
    "                break\n",
    "            if in_body:\n",
    "                writer.write(line)\n",
    "\n",
    "def split_line(line):\n",
    "    return line.replace('—', ' ').split()\n",
    "\n",
    "punc_marks = {}\n",
    "for line in open(filename, encoding='utf-8'):\n",
    "    for char in line:\n",
    "        category = unicodedata.category(char)\n",
    "        if category.startswith('P'):\n",
    "            punc_marks[char] = 1\n",
    "\n",
    "punctuation = ''.join(punc_marks)\n",
    "\n",
    "def clean_word(word):\n",
    "    return word.strip(punctuation).lower()\n",
    "\n",
    "word_counter = {}\n",
    "for line in open(filename, encoding='utf-8'):\n",
    "    for word in split_line(line):\n",
    "        word = clean_word(word)\n",
    "        if word:\n",
    "            word_counter[word] = word_counter.get(word, 0) + 1\n",
    "\n",
    "def second_element(item):\n",
    "    return item[1]\n",
    "\n",
    "def print_most_common(counter, num=5):\n",
    "    items = sorted(counter.items(), key=second_element, reverse=True)\n",
    "    for key, freq in items[:num]:\n",
    "        print(freq, key, sep='\\t')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "id": "75b548a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import random"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "id": "2bfa31ae",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# this cell initializes the random number generator so it \n",
    "# generates the same sequence each time the notebook runs.\n",
    "\n",
    "random.seed(4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cbbd7f8",
   "metadata": {},
   "source": [
    "The `random` module provides a function called `choice` that chooses an element from a list at random, with every element having the same probability of being chosen."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "id": "6f5d5c1c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t = [1, 2, 3]\n",
    "random.choice(t)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57c15af2",
   "metadata": {},
   "source": [
    "If you call the function again, you might get the same element again, or a different one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "id": "1445068b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2 1 3 2 2 1 1 1 1 2 "
     ]
    }
   ],
   "source": [
    "for i in range(10):\n",
    "    print(random.choice(t), end=' ')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f0c2572",
   "metadata": {},
   "source": [
    "In the long run, we expect to get every element about the same number of times.\n",
    "\n",
    "If you use `choice` with a dictionary, you get a `KeyError`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4fc47ecd",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%expect KeyError\n",
    "random.choice(word_counter)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "592722f3",
   "metadata": {},
   "source": [
    "To choose a random key, you have to put the keys in a list and then call `choice`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "id": "91ae9d4c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'estimation'"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "words = list(word_counter)\n",
    "random.choice(words)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "172d72f6",
   "metadata": {},
   "source": [
    "If we generate a random sequence of words, it doesn't make much sense."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "8bf595c1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "withdrew effulgence moral would hitherto seclusion "
     ]
    }
   ],
   "source": [
    "for i in range(6):\n",
    "    word = random.choice(words)\n",
    "    print(word, end=' ')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0e2fbc4",
   "metadata": {},
   "source": [
    "Part of the problem is that we are not taking into account that some words are more common than others.\n",
    "The results will be better if we choose words with different \"**weights**\", so that some are chosen more often than others.\n",
    "\n",
    "If we use the values from `word_counter` as weights, each word is chosen with a probability that depends on its frequency."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "22953b65",
   "metadata": {},
   "outputs": [],
   "source": [
    "weights = word_counter.values()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5098bf93",
   "metadata": {},
   "source": [
    "The `random` module provides another function called `choices` that takes weights as an optional argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "id": "1c7cdf4d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['said']"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "random.choices(words, weights=weights)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a3341e84",
   "metadata": {},
   "source": [
    "And it takes another optional argument, `k`, that specifies the number of words to select."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "a7a3aa42",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['to', 'was', 'and', 'a', 'truly', 'orders']"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "random_words = random.choices(words, weights=weights, k=6)\n",
    "random_words"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e57e6f3d",
   "metadata": {},
   "source": [
    "The result is a list of strings that we can join into something that's looks more like a sentence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "c4286fb3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'to was and a truly orders'"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "' '.join(random_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "d938781f",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Weighted Random Selection\n",
    "\n",
    "import random\n",
    "fruits = ['apple', 'banana', 'cherry']\n",
    "weights = [10, 3, 1]   # apple is most common\n",
    "# 1. Use random.choices() to select 6 fruits with the given weights\n",
    "# 2. Print the result\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "635bbe4b",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['banana', 'banana', 'apple', 'apple', 'apple', 'banana']\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "import random\n",
    "fruits = ['apple', 'banana', 'cherry']\n",
    "weights = [10, 3, 1]\n",
    "print(random.choices(fruits, weights=weights, k=6))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7a35dff",
   "metadata": {},
   "source": [
    "If you choose words from the book at random, you get a sense of the vocabulary, but a series of random words seldom makes sense because there is no relationship between successive words.\n",
    "For example, in a real sentence you expect an article like \"the\" to be followed by an adjective or a noun, and probably not a verb or adverb.\n",
    "So the next step is to look at these relationships between words."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0921dd53",
   "metadata": {},
   "source": [
    "## Bigrams\n",
    "\n",
    "Instead of looking at one word at a time, now we'll look at sequences of two words, which are called **bigrams**.\n",
    "A sequence of three words is called a **trigram**, and a sequence with some unspecified number of words is called an **n-gram**.\n",
    "\n",
    "Let's write a program that finds all of the bigrams in the book and the number of times each one appears.\n",
    "To store the results, we'll use a dictionary where\n",
    "\n",
    "* The keys are tuples of strings that represent bigrams, and \n",
    "\n",
    "* The values are integers that represent frequencies.\n",
    "\n",
    "Let's call it `bigram_counter`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "id": "d8ee02f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "bigram_counter = {}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33f97a2a",
   "metadata": {},
   "source": [
    "The following function takes a list of two strings as a parameter.\n",
    "First it makes a tuple of the two strings, which can be used as a key in a dictionary.\n",
    "Then it adds the key to `bigram_counter`, if it doesn't exist, or increments the frequency if it does."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "id": "bfdb1de1",
   "metadata": {},
   "outputs": [],
   "source": [
    "def count_bigram(bigram):\n",
    "    key = tuple(bigram)\n",
    "    if key not in bigram_counter:\n",
    "        bigram_counter[key] = 1\n",
    "    else:\n",
    "        bigram_counter[key] += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c30f429",
   "metadata": {},
   "source": [
    "As we go through the book, we have to keep track of each pair of consecutive words.\n",
    "So if we see the sequence \"**man is not truly one**\", we would add the bigrams \"**man is**\", \"**is not**\", \"**not truly**\", and so on.\n",
    "\n",
    "To keep track of these bigrams, we'll use a list called `window`, because it is like a window that slides over the pages of the book, showing only two words at a time.\n",
    "Initially, `window` is empty."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "id": "2e73df79",
   "metadata": {},
   "outputs": [],
   "source": [
    "window = []"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9376558c",
   "metadata": {},
   "source": [
    "We'll use the following function to process the words one at a time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "495ad429",
   "metadata": {},
   "outputs": [],
   "source": [
    "def process_word(word):\n",
    "    window.append(word)\n",
    "    \n",
    "    if len(window) == 2:\n",
    "        count_bigram(window)\n",
    "        window.pop(0)\n",
    "\n",
    "        \n",
    "# from collections import defaultdict\n",
    "# model = defaultdict(list)\n",
    "\n",
    "# for i in range(len(words) - 1):\n",
    "#     current_word = words[i]\n",
    "#     next_word = words[i + 1]\n",
    "#     model[current_word].append(next_word)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56895591",
   "metadata": {},
   "source": [
    "The first time this function is called, it appends the given word to `window`.\n",
    "Since there is only one word in the window, we don't have a bigram yet, so the function ends.\n",
    "\n",
    "The second time it's called -- and every time thereafter -- it appends a second word to `window`.\n",
    "Since there are two words in the window, it calls `count_bigram` to keep track of how many times each bigram appears.\n",
    "Then it uses `pop` to remove the first word from the window.\n",
    "\n",
    "The following program loops through the words in the book and processes them one at a time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "id": "c1224061",
   "metadata": {},
   "outputs": [],
   "source": [
    "for line in open(filename):\n",
    "    for word in split_line(line):\n",
    "        word = clean_word(word)\n",
    "        process_word(word)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20c4627a",
   "metadata": {},
   "source": [
    "The result is a dictionary that maps from each bigram to the number of times it appears.\n",
    "We can use `print_most_common` to see the most common bigrams."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "id": "4296485a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "178\t('of', 'the')\n",
      "139\t('in', 'the')\n",
      "94\t('it', 'was')\n",
      "80\t('and', 'the')\n",
      "73\t('to', 'the')\n"
     ]
    }
   ],
   "source": [
    "print_most_common(bigram_counter)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "757bd309",
   "metadata": {},
   "source": [
    "Looking at these results, we can get a sense of which pairs of words are most likely to appear together.\n",
    "We can also use the results to generate random text, like this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "id": "e03fd803",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "random.seed(42)         ### just to make the random choices reproducible\n",
    "bigrams = list(bigram_counter)          ### list of bigrams (tuples of two words)\n",
    "weights = bigram_counter.values()\n",
    "random_bigrams = random.choices(bigrams, weights=weights, k=10)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eda80407",
   "metadata": {},
   "source": [
    "`bigrams` is a list of the bigrams that appear in the books.\n",
    "`weights` is a list of their frequencies, so `random_bigrams` is a sample where the probability a bigram is selected is proportional to its frequency. \n",
    "\n",
    "Here are the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "id": "d6c65d79",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "seriously amiss a man as the but i the cup but here stain of that the silence after by a "
     ]
    }
   ],
   "source": [
    "for pair in random_bigrams:\n",
    "    print(' '.join(pair), end=' ')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f24c3b6",
   "metadata": {},
   "source": [
    "This way of generating text is better than choosing random words, but still doesn't make a lot of sense."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "id": "3e282e3c",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Counting Bigrams\n",
    "\n",
    "sentence = \"the cat sat on the mat the cat\"\n",
    "# 1. Split the sentence into words\n",
    "# 2. Build a bigram_counts dictionary mapping (word1, word2) tuples to their frequency\n",
    "# 3. Print each bigram and its count, sorted by frequency (highest first)\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "\n",
    "### Your code ends here.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "id": "b92e13c1",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2 ('the', 'cat')\n",
      "1 ('cat', 'sat')\n",
      "1 ('sat', 'on')\n",
      "1 ('on', 'the')\n",
      "1 ('the', 'mat')\n",
      "1 ('mat', 'the')\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "sentence = \"the cat sat on the mat the cat\"\n",
    "words_list = sentence.split()\n",
    "bigram_counts = {}\n",
    "for i in range(len(words_list) - 1):\n",
    "    bigram = (words_list[i], words_list[i + 1])\n",
    "    if bigram not in bigram_counts:\n",
    "        bigram_counts[bigram] = 1\n",
    "    else:\n",
    "        bigram_counts[bigram] += 1\n",
    "for bigram, count in sorted(bigram_counts.items(), key=lambda x: x[1], reverse=True):\n",
    "    print(count, bigram)\n"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}