Analyzing text
- can analyze text files containing entire books
- many literary classics are available as text files since they are in the public domain
- texts used here are from Project Gutenberg
- let’s pull text from Alice in Wonderland and count the number of words
- split() – splits a string wherever it finds any whitespace
from pathlib import Path
path = Path(‘alice.txt’)
try:
….contents = path.read_text(encoding=’utf-8′)
except FileNotFoundError:
….print(f”Sorry, the file {path} does not exist.”)
else:
….# Count the approximate number of words in the file:
words = contents.split()
num_words = len(words)
print(f”The file {path} has about {num_words} words.”)
- move alice.txt to the correct directory so the try block will work
- the string contents contains the entire Alice in Wonderland text as one long string
- use split() method to produce a list of words
- use len() on the list, which approximates the number of words in the text
- print a statement that reports how many words are in the file
- this code is placed in the else block because it only works if the code in the try block was executed successfully
The file alice.txt has about 29594 words.
- count is slightly higher due to the publisher’s info
Working with multiple files
- let’s move the bulk of this program to a function called count_words() before adding more books to analyze
- easier to run the analysis for multiple books this way
word_count.py
from pathlib import Path
def count_words(path):
….”””Count the approximate number of words in a file.”””
….try:
……..contents = path.read_text(encoding=’utf-8′)
….except FileNotFoundError:
……..print(f”Sorry, the file {path} does not exist.”)
….else:
….# Count the approximate number of words in the file:
….words = contents.split()
….num_words = len(words)
….print(f”The file{path} has about {num_words} words.”)
path = Path(‘alice.txt’)
count_words(path)
- most of the code is the same, only indented and moved into the body of count_words()
- now we write a short loop to count the words in any text we want to analyze
- do this by storing the names of the files in a list and call count_words() for each file in the list
- we’ll count the words for Alice in Wonderland, Siddhartha, Moby Dick, and Little Women
- we’ll leave siddhartha.txt out of the directory containing word_count.py to see how well our program handles a missing file
from pathlib import Path
def count_words(filename):
….–snip–
filenames = [‘alice.txt’, ‘siddhartha.txt’, ‘moby_dick.txt’, ‘little_women.txt’]
for filename in filenames:
….path = Path(filename)
….count_words(path)
- names of files stored as simple strings
- each string is converted to a Path object before the call to count_words()
- the missing siddhartha.txt file has no effect on the rest of the program’s execution
The file alice.txt has about 29594 words.
Sorry, the file siddhartha.txt does not exist.
The file moby_dick.txt has about 215864 words.
The file little_women.txt has about 189142 words.
- using the try-except block provides two significant advantages
- prevent users from seeing a traceback and let the program continue analyzing the texts it’s able to find
- if we didn’t catch the FileNotFoundError, the user would see a fill traceback and the program would stop running
Failing silently
- previous example, we informed our users that one of the files was unavailable
- don’t need to report every exception you catch
- sometimes you’ll want the program to fail silently when an exception occurs
- write a try block as usual to make a program fail silently, but explicitly tell Python to do nothing in the except block
- Python has a pass statement that tells it to do nothing in a block
def count_words(path):
….”””Count the approximate number of words in a file.”””
….try:
……..–snip–
….except FileNotFoundError:
……..pass
….else:
……..–snip–
- only difference between this listing and the previous one is the pass statement in the except block
- when FileNotFoundError is raised, the code in the except block runs, but nothing happens
- no traceback is produced and there’s no output in response to the error that was raised
- users see the word counts for each file that exists, but they don’t see any indication that a file wasn’t found
The file alice.txt has about 29594 words.
The file moby_dick.txt has about 215864 words.
The file little_women.txt has about 189142 words.
- pass statement also acts as a placeholder
- it’s a reminder that you’re choosing to do nothing at a specific point and you may want to do something there later
- for example, we might decide to write any missing filenames toa. file called missing_files.txt
- users wouldn’t see this file, but we’d be able to read the file and deal with any missing texts
Deciding which errors to report
- how do you know when to report an error and when to let your program fail silently
- if users know which texts are supposed to be analyzed, they might appreciate a message informing them why some texts were not analyzed
- if users expect to see some results but don’t know which books are supposed to be analyzed, they might not need to know what texts were unavailable
- Python error-handling structures let you determine the best course of action
End of study session.