Exercise detail
For our first project, we're going to take a look at SAT scores around the United States. We'll be exploring this data to see what we can learn using the descriptive statistics skills covered this week. Your client, the College Board, is expecting some pretty graphs to add to their presentations this year, so don't let them down!
Goal: A Jupyter notebook that describes your data with visualizations & statistical analysis.
Goal: A five to seven minute presentation targeted to your hypothetical client that highlights your findings.
Requirements
Your work must:
Describe your data
Perform methods of exploratory data analysis, including:
Use Matplotlib to create visualizations
Use NumPy to apply basic summary statistics: mean, median, mode
Determine if the dataset appears to follow a normal distribution
Bonus:
Recreate all of your MatPlotLib graphs in Seaborn!
Use Tableau (public) to create visualizations!
Create a blog post of at least 500 words (and 1-2 graphics!) describing your data, analysis, and approach. Link to it in your Jupyter notebook.
Using existing features, engineer a new feature
Necessary Deliverables / Submission
Materials must be submitted in a clearly commented Jupyter notebook.
Notebook must be submitted via a GitHub pull request to the instructor's repo (the same way you submit labs).
Presentation must be submitted via slack (for a powerpoint file) or shared via a google slides Link
Materials must be submitted by 9:00 AM on Friday, June 30.
Starter code
For this project we will be using a Jupyter notebook. This notebook will use matplotlib for plotting and visualizing our data. This type of visualization is handy for prototyping and quick data analysis. We will discuss more advanced data visualizations for disseminating your work.
Open the starter code instructions in a Jupyter notebook.
Dataset
Dataset: SAT Scores
This data, taken from the College Board, gives the mean SAT math(s) and verbal scores, and the participation rate for each state and the District of Columbia for the year 2001.
Suggested Ways to Get Started
Read in your dataset.
Try out a few NumPy commands to describe your data.
Write pseudocode before you write actual code. Thinking through the logic of something helps.
Read the docs for whatever technologies you use. Most of the time, there is a tutorial that you can follow, but not always, and learning to read documentation is crucial to your success!
Document everything.
Useful Resources
How to find the data you need
How to give a good lightning talk
Presentation Structure
5-7 minutes long.
Use Powerpoint or some other visual aid.
Consider the audience. Assume you are presenting to non-technical executives with the College Board (the organization that administers the SATs).
Start with the guiding question/big idea.
Talk about your procedure/methodology (high level, no need to show code unless you found a useful method to share).
Talk about your findings/answers to prompts (include visuals).
Conclude - highlight any next steps, further questions, what you would do with more time, additional data that would be useful.
Be sure to rehearse and time your presentation before class.
Project Feedback + Evaluation
Your instructors will score you using the scale below:
Score | Expectations
----- | ------------
**0** | _Incomplete._
**1** | _Does not meet expectations._
**2** | _Meets expectations, good job!_
**3** | _Exceeds expectations, you wonderful creature, you!_
This will serve as a helpful overall gauge of whether you met the project goals!
STEP 1 STARTER CODE INSTRUCTIONS
Step 1: Open the sat_scores.csv file. Investigate the data, and answer the questions below.
1. What does the data describe?
In [ ]:
## your answer here
2. Does the data look complete? Are there any obvious issues with the observations?
In [ ]:
## your answer here
3. Describe in words what each variable(column) is.
In [ ]:
## your answer here
Step 2: Load the data.
4. Load the data into a list of lists
In [ ]:
5. Print the data
In [ ]:
6. Extract a list of the labels from the data, and remove them from the data.
In [ ]:
7. Create a list of State names extracted from the data. (Hint: use the list of labels to index on the State column)
In [ ]:
8. Print the types of each column
In [ ]:
9. Do any types need to be reassigned? If so, go ahead and do it.
In [ ]:
10. Create a dictionary for each column mapping the State to its respective value for that column.
In [ ]:
11. Create a dictionary with the values for each of the numeric columns
In [ ]:
Step 3: Describe the data
12. Print the min and max of each column
In [ ]:
13. Write a function using only list comprehensions, no loops, to compute Standard Deviation. Print the Standard Deviation of each numeric column.
In [ ]:
Step 4: Visualize the data
14. Using MatPlotLib and PyPlot, plot the distribution of the Rate using histograms.
In [ ]:
15. Plot the Math(s) distribution
In [ ]:
16. Plot the Verbal distribution
In [ ]:
17. What is the typical assumption for data distribution?
In [ ]:
18. Does that distribution hold true for our data?
In [ ]:
19. Plot some scatterplots. BONUS: Use a PyPlot figure to present multiple plots at once.
In [ ]:
20. Are there any interesting relationships to note?
In [ ]:
21. Create box plots for each variable.
In [ ]:
BONUS: Using Tableau, create a heat map for each variable using a map of the US.
In [ ]:
DATA
State Rate Verbal Math
CT 82 509 510
NJ 81 499 513
MA 79 511 515
NY 77 495 505
NH 72 520 516
RI 71 501 499
PA 71 500 499
VT 69 511 506
ME 69 506 500
VA 68 510 501
DE 67 501 499
MD 65 508 510
NC 65 493 499
GA 63 491 489
IN 60 499 501
SC 57 486 488
DC 56 482 474
OR 55 526 526
FL 54 498 499
WA 53 527 527
TX 53 493 499
HI 52 485 515
AK 51 514 510
CA 51 498 517
AZ 34 523 525
NV 33 509 515
CO 31 539 542
OH 26 534 439
MT 23 539 539
WV 18 527 512
ID 17 543 542
TN 13 562 553
NM 13 551 542
IL 12 576 589
KY 12 550 550
WY 11 547 545
MI 11 561 572
MN 9 580 589
KS 9 577 580
AL 9 559 554
NE 8 562 568
OK 8 567 561
MO 8 577 577
LA 7 564 562
WI 6 584 596
AR 6 562 550
UT 5 575 570
IA 5 593 603
SD 4 577 582
ND 4 592 599
MS 4 566 551
All 45 506 514
MISCELLANEOUS (NOT NEEDED DATA)
# OSX DS Store
.DS_Store
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# IPython Notebook
*.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# dotenv
.env
# virtualenv
venv/
ENV/
# Spyder project settings
.spyderproject
# Rope project settings
.ropeproject