{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7ede0460",
   "metadata": {},
   "source": [
    "\n",
    "# Application: Word List\n",
    "\n",
    "Let's apply what we've learned to a real-world task: building and searching a word list.\n",
    "\n",
    "In the previous chapter, we read the file `words.txt` and searched for words with certain properties, like using the letter `e`.\n",
    "But we read the entire file many times, which is not efficient.\n",
    "It is better to read the file once and put the words in a list.\n",
    "The following loop shows how."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bce6fc92",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "from pathlib import Path\n",
    "\n",
    "# Find project root by looking for _config.yml\n",
    "current = Path.cwd()\n",
    "for parent in [current, *current.parents]:\n",
    "    if (parent / '_config.yml').exists():\n",
    "        project_root = parent\n",
    "        break\n",
    "else:\n",
    "    project_root = Path.cwd().parent.parent\n",
    "\n",
    "# Add project root to path\n",
    "sys.path.insert(0, str(project_root))\n",
    "\n",
    "# Import shared teaching helpers and cell magics\n",
    "from shared import thinkpython, diagram, jupyturtle, structshape\n",
    "from shared.download import download\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "id": "afb8c3bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "words_file = project_root / 'data' / 'words.txt'\n",
    "if not words_file.exists():\n",
    "    download('https://raw.githubusercontent.com/AllenDowney/ThinkPython/v3/words.txt', words_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "id": "ec2e7239",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "113783"
      ]
     },
     "execution_count": 95,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_list = []\n",
    "\n",
    "for line in open(words_file, encoding='utf-8'):\n",
    "    word = line.strip()\n",
    "    word_list.append(word)\n",
    "    \n",
    "len(word_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "id": "01fe5d61",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['aa',\n",
       " 'aah',\n",
       " 'aahed',\n",
       " 'aahing',\n",
       " 'aahs',\n",
       " 'aal',\n",
       " 'aalii',\n",
       " 'aaliis',\n",
       " 'aals',\n",
       " 'aardvark']"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_list[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18ee706a",
   "metadata": {},
   "source": [
    "Before the loop, `word_list` is initialized with an empty list.\n",
    "Each time through the loop, the `append` method adds a word to the end.\n",
    "When the loop is done, there are more than 113,000 words in the list.\n",
    "\n",
    "Another way to do the same thing is to use `read` to read the entire file into a string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "id": "d62cf70f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1016511"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string = words_file.read_text(encoding='utf-8')\n",
    "len(string)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46a8329f",
   "metadata": {},
   "source": [
    "The result is a single string with more than a million characters.\n",
    "We can use the `split` method to split it into a list of words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "id": "8b06681f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "113783"
      ]
     },
     "execution_count": 98,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_list = string.split()\n",
    "len(word_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8cb6ae5",
   "metadata": {},
   "source": [
    "Evaluating the variable `word_list` in Jupyter Notebook will give you the whole list, which is very long, so let us use a for loop to take a look at the first 5 elements:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "id": "7013b629",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "aa\n",
      "aah\n",
      "aahed\n",
      "aahing\n",
      "aahs\n"
     ]
    }
   ],
   "source": [
    "for i in range(5):\n",
    "    print(word_list[i])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1cd1c24c",
   "metadata": {},
   "source": [
    "Or just use slicing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "id": "6278b792",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['aa', 'aah', 'aahed', 'aahing', 'aahs']"
      ]
     },
     "execution_count": 100,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_list[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7fbcb9ca",
   "metadata": {},
   "source": [
    "And we always want to know the data type of our data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "id": "9e8b69c3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'list'>\n"
     ]
    }
   ],
   "source": [
    "print(type(word_list))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d20c3fdb",
   "metadata": {},
   "source": [
    "Now, to check whether a string appears in the list, we can use the `in` operator.\n",
    "For example, `'demotic'` is in the list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "id": "b67f325f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 102,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'demotic' in word_list"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c82017fb",
   "metadata": {},
   "source": [
    "But `'contrafibularities'` is not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "id": "6334664a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 103,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'contrafibularities' in word_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "id": "3d62f066",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 104,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"supercalifragilisticexpialidocious\" in word_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "id": "c45d528b",
   "metadata": {
    "tags": [
     "thebe-interactive"
    ]
   },
   "outputs": [],
   "source": [
    "### EXERCISE: Word List Application\n",
    "# Difficulty: Challenge\n",
    "# Using word_list from this section:\n",
    "# 1. Print the first 3 words\n",
    "# 2. Count how many words start with \"a\"\n",
    "# 3. Print the average word length (rounded to 2 decimals)\n",
    "# 4. Find and print the longest word among the first 5000 words\n",
    "### Your code starts here:\n",
    "\n",
    "\n",
    "### Your code ends here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "id": "8f67fc7d",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['aa', 'aah', 'aahed']\n",
      "6557\n",
      "7.93\n",
      "anticonservationist\n"
     ]
    }
   ],
   "source": [
    "# Solution\n",
    "print(word_list[:3])\n",
    "count_a = sum(1 for w in word_list if w.startswith('a'))\n",
    "print(count_a)\n",
    "avg_len = sum(len(w) for w in word_list) / len(word_list)\n",
    "print(round(avg_len, 2))\n",
    "longest_5k = max(word_list[:5000], key=len)\n",
    "print(longest_5k)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": ".venv (3.13.7)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
