Discussion Post I
This Discussion Board is designed to familiarize you with some of the programming features of R as well as to help you explore and understand why R is useful when analyzing big data sets.
Complete the course readings, and focus your discussion on the following:
i. Compare the statistical features of R to its programming features.
ii. In a table format, describe the programming features available in R.
a. Explain how they are useful in analyzing big data sets.
iii. Describe how the analytics of R are suited for big data.
Discussion Post II
Data mining and analytics are used to test hypotheses and detect trends from very large data sets. In statistics, the significance is determined to some extent by the sample size.
Focus your discussion on the following:
i. How can supervised learning be used in such large data sets to overcome the problem where everything is significant with statistical analysis?
ii. Discuss the importance of a clear purpose in supervised learning and the use of random sampling.
iii. Discuss the issue of large data sets and their relationships being significant but explaining little variance.
iv. Do large data sets lend themselves better to prediction rather than smaller data sets?
Discussion Post III
The discussion is designed to help you understand regression analysis. Specifically, you will explore both the generalized least squares model as well as the procedures in R for linear regression.
After completing the course readings, focus your discussion on the following:
i. Compare the assumptions of generalized least squares model for regression and correlations.
ii. Cover the issues with transforming variables to make them linear.
iii. Describe the procedures in R for linear regression.
iv. What differences or similarities do you see between your posting and other classmates' postings?
v. What are assumptions of normality, and how are they tested?
Discussion Post IV
Building on last week's exploration of regression analysis, this Discussion Board introduces the locally weighted scatterplot smoothing (LOWESS) method for multiple regression as well as how to run it using R.
After completing the readings for the week, use R and address the following:
i. Is using the locally weighted scatterplot smoothing (LOWESS) method for multiple regression models in a k-nearest neighbors-based model a parametric or nonparametric method?
a. Discuss some of the advantages and disadvantages of LOWESS from a computational standpoint.
ii. Open this GitHub repository.
a. Select a data set from the HTML index (select any data set, but not Old Faithful).
b. Post the name of that data set in the thread so that the other students will know not to select it.
c. When another student has already chosen your selected data set (check the thread), you need to pick another-there are plenty of them. If two of you must have the same data set, you will need to discuss and make sure that each is testing a different model.
iii. For the data set that you selected, examine the variables, and prepare a multiple regression in R.
iv. Show the code and the output, and discuss your results.
v. How does LOWESS compare to the other producers used in this course?
vi. How important is parametric data for LOWESS?
Discussion Post V
The topic for this Discussion Board is statistical inference in logistic regression.
After completing the required readings for the week, focus your discussion on the following:
i. Discuss the assumptions that must be met for logistic regression and assumptions for regular regression that do not apply in logistic regression.
ii. Discuss the types of variables that are used in logistic regression and regular regression.
iii. Describe the nature of the variables used in logistic regression.
iv. Compare fitting a model in regression with a continuous dependent variable to determining fit in logistic regression.
Discussion Post VI
This Discussion Board further examines logistical regression. Here, you will discuss how the IV is interpreted in logistical regression and if logistical regression could be used for predicting categorical outcomes.
Complete the assigned readings for the week as well as any additional research to gather the information that you need to discuss the following:
i. In assessing the predictive power of the categorical predictors of a binary outcome, should logistic regression be used? In other words, how is the logistic function used to predict categorical outcomes?
ii. Describe the predictive power of categorical variables on binary outcomes.
iii. Describe the dependent variable in logistical regression.
Discussion Post VII
In this Discussion Board, you will use your research skills to expand your knowledge of Bayesian analysis for social media.
Utilizing the CTU Library, locate an article on the use of Bayesian analysis for social media. After reading the article, address the following:
i. Provide examples of how Bayesian analysis can be used in the context of social media data.
ii. Search for working papers or published papers on analyzing social media.
a. Use the CTU Library for sources.
b. You are also free to provide examples from working papers on the subject of Bayesian analysis and social media analysis.
iii. Summarize how the article that you found uses Bayesian analysis in the context of social media analysis.
iv. Bayesian is becoming increasingly popular-is social media well-suited for Bayesian analysis?
Discussion Post VIII
Using the German credit data that you downloaded for your Individual Project in Unit 6, address the issues of lending that result in a default. You must post an analysis and discuss your findings.
The two outcomes are success (defaulting on the loan) and failure (not defaulting). The explanatory variables in the logistic regression are both the type of loan and the borrowing amount. Address the following:
i. For the k-means classification, use 3 continuous variables-duration, amount, and installment.
ii. Use cross-validation with k = 5 for the nearest neighbor.
iii. What were your biggest challenges in creating this R program using logistic regression and k-means analysis?
Discussion Post IX
This Discussion Board expands upon the uses of Bayesian analysis within the context of creating ensembles.
After completing the required readings, address the following in your primary post:
i. Discuss creating ensembles from different methods such as logistic regression, the nearest neighbor method, classification trees, the Bayesian method, or discriminate analysis.
ii. Discuss the use of randomForest to do analysis.
Discussion Post X
Now that you are nearing the end of this course, take a moment to summarize the best learning points of the course as well as the areas that you feel contributed less to your learning.
Focus your discussion on the following:
i. Now that you are at the end of studying big data analytics using R, what do you view as the most valuable and the least valuable parts of this course?
ii. Read the observations of your classmates, and comment on their observations with respect to the most valuable and least valuable aspects of the course-do you see common themes in their comments?
Discussion Post XI
Review and reflect on the knowledge that you have gained from this course.
Based on your review and reflection, write at least 400-600 words on the following:
i. What were the most compelling topics that you learned in this course?
ii. How did participating in discussions help your understanding of the subject matter? Is anything still unclear that could be clarified?
iii. What approaches could have yielded additional valuable information?
The response must include a reference list. Using Times New Roman 12 pnt font, double-space, one-inch margins, and APA style of writing and citations.