Project
This project will be open-ended. It will consist of your approach to applying data science techniques to a concrete problem. Consider a data-related problem in a field you are interested in. Pick a subject that you like, so that this project means something to you. You should use public data sources; I mentioned some during the course; I will give you some suggestions at the end of this document. You can use any data that you like (even scrape it, if you wish).
Basically, ask yourself a question related to data, collect and visualize the data, then answer (or say something about) the problem you asked. You can use your imagination, or build on examples we used during the semester.
You need to submit a written document together with code, touching on the elements mentioned below. You need to incorporate visualizations in your project. Your project should be a pdf (say a document saved as pdf), or an R presentation (markdown / shiny), or an html file. If you have something else in mind, let me know before you start.
Your project should include:
- What is the question you hope to answer?
- What data are you planning to use to answer that question?
- What do you know about the data so far?
- Why did you choose this topic?
- How did you gather the data?
- How did you preprocess (clean) the data?
- What methods did you use to filter the data, if appropriate?
- What programming language did you use and why?
- How did you model (visualize) the data?
- All the relevant code and output.
- Conclusion (did you solve the question in the beginning?)
The project should be based on what we discussed and does not have to incorporate advanced statistical analysis
Public Data Sources Examples (google the names):
- data.gov
- NYC open data, OpenData DC, DataLA
- Yelp data
- UN data
- Twitter data
- Rdatasets
- pythonapi
- Quandl
- US Census
- County Health Data
- City Portals