SIT742 Modern Data Science 2018 T1
Assignment : General data processing and using big data 
This is an individual work on the understanding of data science, big data and their applications. It contains written answers and some programming-related tasks based on topics presented during Weeks 1 to 3.
Assignment  is broken down to three tasks below. You can use Google to find the data sources (i.e., websites). After your practice, please write down your executable Python codes and put the collected data in tables for above demonstration. You also need to write several paragraphs to explain your comparison and make a conclusion.
Task 1. Data Acquisition 
Design a web scraping program by Python to collect weather forecast report data of a city (e.g., Melbourne) from a website, such as temperature, humidity, weather status (cloudy, sunny etc.), and store the data in a csv file. Please do this task in both the following ways:
(1) Collecting data by regular interval sampling. You need to find the best sampling interval in terms of space efficiency and demonstrate using numeric results why it is the best solution.
(2) Collecting data by change detection. You store one data object only when any of the weather forecast report data is changed at the website.
Both you need to record weather data with their timestamps. Then, compare the two collection methods, conclude the optimal one and demonstrate using numeric results.(Please refer to Lecture 2.)
Task 2. Data Integration 
Use the optimal method you demonstrated above to collect weather report data from more than one websites and integrate the data from different sources (websites) and write the integrated data into a csv file. Please demonstrate
(1) how to do schema alignment and
(2) how to determine which is correct if two data from different sources do not agree with each other.
(Please do a survey about the existing techniques and use one to resolve the problem, Lecture 2 provides you some basic concepts and you may do a broader search by yourselves.)
Task 3. Missing Data Prediction
Use the data you collected in Tasks 1 and 2, please design a method to predict a missing data object, for example, between two consecutive data objects (time, temperature) in your csv file as below:
11:00AM, 15
12:00PM, 17
the user want to query about the temperature at 11:30AM.
(Please do a survey about the existing techniques and use one to resolve the problem. Lecture 3 provides you some basic concepts and you may do a broader search by yourselves.)