Assignment:
Question 1. Write a python program to process an input file called 1865-Lincoln.txt, which is attached in the dropbox. You can use NLTK data and functions for this assignments.
1. Calculate the frequency distribution of the words in the file. Plot a histogram for the top 20 most common used words in the text file.
2. Open the file and read the contentsto generate an output file which contains the lemmatized words of the original content. The output file must be created by replacing all the words with lemmatized words.
3. Tokenize the text file into sentences and calculate the number of words in each sentence, and its entropy. Output the results in the following format.
Sentence (The first 5 words with ...) #of Words Entropy
At this second appearing to ... 6 7.233
4. For the text file1865-Lincoln.txt, conduct part-of-speech tagging using one of the taggers and one of the tagged corpus in the NLTK toolkit. The program outputs the tagged text into a text file and named it by adding "tagged_" before the original text filename.
Question 2. I am interested in knowing how the climate changes in terms of temperature and precipitation. The U.S. climate data site contains climate data for Denton, Texas since 2009. I would like you to do some calculation and comparison between the data in 2010 and 2017 in order to answer the following two research questions:
RQ1: Is 2017 significantly different from 2010 on temperature and precipitation in the months January-June?
RQ2: Is June 2017 significantly hotter than June 2010?
1. Create two files based on data published at U.S. climate data:
- File A (should be called 2010-Jan-June.txt, or 2010-Jan-June.csv) contains daily weather data from January 1, 2010 to June 30, 2010;
- File B (called2017-Jan-June.txt, or 2017-Jan-June.csv) contains daily weather data from January 1, 2017 to June 30, 2017;
To find the data, go to "History" tab of the above page, select the right year and month. You will see the data being presented to you.
The final format of each result file should look like the following
1-Jan,55,33,0.08
2-Jan,55,33,0.12
1-June,80,56,0,15
The delimiter can be comma (,) or whitespace. Make sure you round the numbers for the temperature so there is no decimal points.
2. Write a program to calculate the mean, median, and standard deviation of high temperature, low temperature and precipitation of each file, and output the results in the following format:
File name mean median standard deviation
2010-Jan-June. ---- ---- ----
2017-Jan-June.
3. In order to answer the first research question, we would like to conduct some statistical tests. Take File A and File B, conduct a T-test on TWO RELATED samples (you can use scipy.stats.ttest_rel: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_rel.html) on Temperature High, Temperature Low, and Precipitation to find out whether there is a significant difference between these scores. Report your results in statements after the program using the docstring.
4. Describe how you can answer research question 2 and what your answer is.
Attachment:- Fellow-Countrymen.rar