1) The average number of sentences per word over time
Do this two ways:
First, using a simple regex that splits on puntuation. Second, using the natural language toolkit's sentence tokenizer.
2) The average number of unique words per 100 words over time
- remove common words ( see the lecture slide for a list )
Do this two ways:
First: assume that different words are all unique, even if they share suffixes ( like run and running )
Second: using the stemming code from the NLTK.
Make sure that you sort the speeches by their date, and write the data into a text file with two columns: date and statistic.
EC) Build a word cloud, as seen in the slides from lectures 11 and 12
Part 2: Collect an interesting web statistic.
Use urlweb and a website of your choice to collect a statistic. Write a one paragraph description of your statistic at the top of the code
file.
Examples include sports game data, weather statistics, name statistics, etc.