Assignment: K-Means Data Analytics
Problem I
Distance is a key notion underlying many data mining algorithms, such as k-nearest neighbor (k-NN). Why can it be a problem to compare customers using regular Euclidean distance such as when they are described by age (in years), income (in dollars), and number of credit cards? How can this problem be fixed?
Problem II
You currently work for Aperture Science, a small company that sells information technology (IT) products. The lone data scientist at Aperture approaches you one day and proposes to use k¬-NN estimation to build a model to predict the IT budget of companies to identify potential new clients. They would like your help building and deploying the model.
The only data you have on hand is a sample of companies across the United States, which includes their IT budget for last year, their total revenue last year, their total number of employees last year, and their industry classification. This data will make up your database of potential neighbors. Ultimately,as a first true test of the model you want predict the IT budgetfor Acme Corp., a potential client for whom you do not know their IT budget (but you know their total revenue, number of employees, and industry classification).
i. Given the information above, explain how you could estimate Acme's IT budgetusing k-NN.
ii. If you chose k=N, the total number of training examples, what would be the effect?
Problem III
After seeing a presentation on the power of data mining and hearing about your past consulting work, you are approached to create a model for a university's admissions office. The members of the admissions review board just heard about a technique called "clustering", and they think it would be a good idea to try this technique on the newest batch of applying students for the upcoming academic year. By sorting applicants into groups, they believe it will be easier to then make application and recruiting decisions.
For confidentiality reasons, you are provided with only a few select attributes for each of the university's roughly 4,000 undergraduate applicants.
Attributes
i. hsgpa- the applicant's high school GPA (out of 4)
ii. sat - the applicant's SAT score (out of 1600)
iii. hsize - the size of the applicant's graduating class
iv. athlete - whether the student participated in high school athletics for at least 1 year
Task
i. Create a cluster model. Use the default k-means algorithm and set the number of clusters (k) to 3.Here is the model URL (you don't have to create it)
In about 5-8 sentences, briefly describe each of the 3 clusters and explain any overarching takeaways about the groups of applicants this university receives.
ii. Based on your findings in Part A, what type of supervised learning could help further explore the data and these clusters? Do you believe this would provide any meaningful information and do have any concerns about a university making application decisions using these types of models (supervised or unsupervised)? Explain why or why not
Format your assignment according to the following formatting requirements:
i) The answer should be typed, using Times New Roman font (size 12), double spaced, with one-inch margins on all sides.
ii) The response also includes a cover page containing the title of the assignment, the student's name, the course title, and the date. The cover page is not included in the required page length.
iii) Also include a reference page. The Citations and references must follow APA format. The reference page is not included in the required page length.