DSC 341 Foundations of Data Science Assignment- DePaul University- College of Computing and Digital Media.
Problem 1
Answer the following questions and provide details.
1. What is supervised and unsupervised learning? How do they compare?
2. Find at least two real applications of data mining/ machine learning from companies, universities etc., provide brief information on what they are, how they are useful. Provide links to the applications. NOTE: This should be a tangible product or application, not just a link to the website of a company claiming general stuff regarding data mining.
Use the following table to answer Problem 2 and Problem 3. "jogging" is the outcome variable. "weather" and "temperature" are the explanatory variables.
weather
|
temperature
|
jogging
|
sunny
|
warm
|
yes
|
sunny
|
warm
|
yes
|
rainy
|
cold
|
no
|
rainy
|
warm
|
yes
|
rainy
|
cold
|
no
|
rainy
|
cold
|
no
|
sunny
|
hot
|
no
|
rainy
|
warm
|
no
|
rainy
|
hot
|
no
|
sunny
|
cold
|
no
|
sunny
|
cold
|
no
|
sunny
|
cold
|
yes
|
sunny
|
warm
|
yes
|
sunny
|
hot
|
no
|
sunny
|
hot
|
yes
|
rainy
|
hot
|
yes
|
rainy
|
hot
|
no
|
rainy
|
cold
|
yes
|
sunny
|
cold
|
yes
|
sunny
|
cold
|
no
|
sunny
|
hot
|
yes
|
sunny
|
warm
|
no
|
sunny
|
warm
|
no
|
sunny
|
warm
|
yes
|
sunny
|
hot
|
yes
|
Problem 2
Answer the following question using the table given.
1. How does decision tree algorithm work? Define in your own words. W hat is the goal at each split? How can you define impurity?
2. What is the formulae for entropy and gini index?
3. What is the gini index at the initial state?
4. What is the entropy at the initial state?
5. What is the information gain splitting on temperature as your first split using gini index? Split into each group separately (split into warm, cold, and hot)
6. What is the information gain splitting on weather as your first split using gini index? Split into each group separately.
7. Based on questions 5 and 6, which variable would be chosen for the first split using gini index, weather or temperature?
8. What is the information gain splitting on temperature as your first split using entropy measure? Split into each group separately (split into warm, cold, and hot)
9. What is the information gain splitting on weather as your first split using entropy measure? Split into each group separately.
10. Based on questions 8 and 9, which variable would be chosen for the first split using entropy measure, weather or temperature?
A decision tree was trained on the data given in the table and the following was obtained. Answer the following questions based on the diagram.
11. Based on the decision tree diagram given, what is the most important variable based on the decision tree? Why?
12. Based on the decision tree diagram given, which class would the following observations belong to? Explain how you would solve this question for a and b. The rest you can just give the answer.
a. temperature= ‘warm', weather='sunny'
b. temperature= ‘hot', weather='rainy'
c. temperature= ‘warm', weather='rainy'
d. temperature= ‘cold', weather='sunny'
e. temperature= ‘hot', weather='sunny'
Problem 3
Solve Problem 3 using the table given above. You will be using Naïve Bayes algorithm for this question. Show your calculations and write down the related formulae for each question. "jogging" is the outcome variable.
1. What is P(jogging='yes')
2. What is P(jogging='no')
3. What is P(weather='sunny')
4. What is P(temperature='warm')
5. What is P(weather="sunny" | jogging="yes")
6. What is P(weather="rainy" | jogging="no")
7. What is P(jogging="yes" | weather="sunny")
8. What is P(jogging="no" | weather="cold")
9. What is P(jogging="yes" | temperature="warm")
10. What is P(jogging="no" | temperature="warm")
11. Which class would the following observation belong to? (weather='sunny', temperature='cold')
12. Which class would the following observation belong to? (weather='sunny', temperature=warm)
13. Which class would the following observation belong to? (weather=rainy, temperature='cold')
Problem 4
1. Explain how K-nearest neighbors algorithm works.
2. What is the training and testing in KNN?
3. How does the number K affect the behavior of the model? W hat happens when K is too large, and what happens when K is too small?
4. According to the figure given above what will the blue point be classified as when the distance measure used is Manhattan distance (L1 distance) for the following K values:
a. K=1
b. K=7
c. K=13
5. According to the figure given above what will the blue point be classified as when the distance measure used is Euclidean distance for the following K values:
a. K=1
b. K=7
c. K=13
6. According to the figure given above what will the blue point be classified as when the distance measure used is weighted Euclidean distance for the following K values: The weight is: w=1/d where d is the Euclidean distance between the points.
a. K=1
b. K=7
c. K=13
7. Compare question 4, 5, and 6. Do any classification results change?
8. Can KNN be used for regression? If so, how does the algorithm work for regression?
Problem 5
1. Is k-means a supervised or unsupervised algorithm? Explain.
2. Explain how k-means algorithm works in your own words.
3. Given the figure below which is the scatter plot of the dataset given with two variables x1 and x2, perform k-means algorithm with K=2 for 2 iterations. Use the mean of the cluster to calculate the cluster centroid. The initial centroids are given as (2,3) for Cluster 1, and (3,2) for Cluster 2. In case of a tie assign the point to Cluster 1. Show your steps. At the end of two iterations plot the scatter plot and show the clusters. You can color by two different colors. Show the centroids of the clusters at the end of 2 iterations.
x1
|
x2
|
3
|
1
|
3
|
2
|
3
|
3
|
3
|
5
|
2
|
2
|
2
|
3
|
4
|
4
|
4
|
1
|
4
|
2
|
5
|
1
|
5
|
3
|
5
|
4
|
5
|
5
|
6
|
2
|
Format your assignment according to the following formatting requirements:
1. The answer should be typed, double spaced, using Times New Roman font (size 12), with one-inch margins on all sides.
2. The response also includes a cover page containing the title of the assignment, the student's name, the course title, and the date. The cover page is not included in the required page length.
3. Also include a reference page. The Citations and references should follow APA format. The reference page is not included in the required page length.
Attachment:- Foundations-of-Data-Science.rar