9.1.5. Application: Word List#
Let’s apply what we’ve learned to a real-world task: building and searching a word list.
In the previous chapter, we read the file words.txt and searched for words with certain properties, like using the letter e.
But we read the entire file many times, which is not efficient.
It is better to read the file once and put the words in a list.
The following loop shows how.
import sys
from pathlib import Path
# Find project root by looking for _config.yml
current = Path.cwd()
for parent in [current, *current.parents]:
if (parent / '_config.yml').exists():
project_root = parent
break
else:
project_root = Path.cwd().parent.parent
# Add project root to path
sys.path.insert(0, str(project_root))
# Import shared teaching helpers and cell magics
from shared import thinkpython, diagram, jupyturtle, structshape
from shared.download import download
from pathlib import Path
words_file = project_root / 'data' / 'words.txt'
if not words_file.exists():
download('https://raw.githubusercontent.com/AllenDowney/ThinkPython/v3/words.txt', words_file)
word_list = []
for line in open(words_file, encoding='utf-8'):
word = line.strip()
word_list.append(word)
len(word_list)
113783
word_list[:10]
['aa',
'aah',
'aahed',
'aahing',
'aahs',
'aal',
'aalii',
'aaliis',
'aals',
'aardvark']
Before the loop, word_list is initialized with an empty list.
Each time through the loop, the append method adds a word to the end.
When the loop is done, there are more than 113,000 words in the list.
Another way to do the same thing is to use read to read the entire file into a string.
string = words_file.read_text(encoding='utf-8')
len(string)
1016511
The result is a single string with more than a million characters.
We can use the split method to split it into a list of words.
word_list = string.split()
len(word_list)
113783
Evaluating the variable word_list in Jupyter Notebook will give you the whole list, which is very long, so let us use a for loop to take a look at the first 5 elements:
for i in range(5):
print(word_list[i])
aa
aah
aahed
aahing
aahs
Or just use slicing.
word_list[:5]
['aa', 'aah', 'aahed', 'aahing', 'aahs']
And we always want to know the data type of our data:
print(type(word_list))
<class 'list'>
Now, to check whether a string appears in the list, we can use the in operator.
For example, 'demotic' is in the list.
'demotic' in word_list
True
But 'contrafibularities' is not.
'contrafibularities' in word_list
False
"supercalifragilisticexpialidocious" in word_list
False
### EXERCISE: Word List Application
# Difficulty: Challenge
# Using word_list from this section:
# 1. Print the first 3 words
# 2. Count how many words start with "a"
# 3. Print the average word length (rounded to 2 decimals)
# 4. Find and print the longest word among the first 5000 words
### Your code starts here:
### Your code ends here.
['aa', 'aah', 'aahed']
6557
7.93
anticonservationist