Question 1
Box office collection of 150 Bollywood movies were analysed using the variables described in Table 1.
Table 1. Data Dictionary
S.No
|
Variable
|
Variable Type
|
Code in SPSS output
|
1
|
Box office Collection (Y)
|
Numerical (in crores of rupees)
|
Box Office Collection
|
2
|
Release Time
|
Categorical with 4 levels
|
Releasing_Time_Festival Season Releasing_Time_Holiday Season Releasing_Time_Long Weekend Releasing_Time_Normal_Season
|
3
|
Genre
|
Categorical with 5 levels
|
Genre_Action (Action) Genre_Drama (Drama) Genre_Romance (Romance) Genre_Comedy (Comedy) Genre_Others (Other-G)
|
4
|
Movie Content
|
Categorical with 3 levels
|
Masala (Masala) Sequel (Sequel) Others (Other_C)
|
5
|
Director Category
|
Categorical with 3 levels
|
Director_A Director_B Director_O
|
6
|
Lead Actor Category
|
Categorical with 3 levels
|
Actor_A Actor_B Actor_O
|
7
|
Music Director Category
|
Categorical with 3 levels
|
Music_Dir_CAT A Music_Dir_CAT B Music_Dir_CAT C
|
8
|
Production House Category
|
Categorical with 3 levels
|
Prod_House_CAT A Prod_House_CAT B Prod_House_CAT C
|
9
|
Item Song
|
Binary variable
|
Item_Song (1 implies that the movie has an item song, 0 otherwise)
|
10
|
Budget
|
Numerical (in crores of rupees)
|
Budget
|
11
|
YouTube Views
|
Numerical
|
YouTube-V
|
12
|
YouTube Likes
|
Numerical
|
YouTube-L
|
13
|
YouTube Dislikes
|
Numerical
|
YouTube-D
|
14
|
Budget More than 35 crores
|
Categorical
|
Budget_35_Cr (1 if the budget is more than 35 crores 0 otherwise)
|
A simple linear regression model was developed between Box office collection and budget. SPSS output of the model is shown in Tables 2-3 and Figures 1-2.
Model 1
Y (Box Office Collection) = β0 + β1x Budget
Table 2. Model Summary
Model
|
R
|
R Square
|
Adjusted R Square
|
Std. Error of the Estimate
|
1
|
.650a
|
|
|
72.02261
|
a. Predictors: (Constant), Budget
b. Dependent Variable: Box_Office_Collection
Table 3. Coefficientsa
Model
|
Unstandardized Coefficients
|
Standardized Coefficients
|
T
|
Sig.
|
B
|
Std. Error
|
Beta
|
(Constant)
|
-8.354
|
8.535
|
.650
|
-.979
|
.329
|
1
|
|
|
|
|
Budget
|
2.175
|
.210
|
10.381
|
.000
|
a. Dependent Variable: Box_Office_Collection
Figure 1. Normal P_P plot for Model 1
Figure 2. Residual plot for Model 1
Question 1.1
Which of the following statements are correct (more than one may be correct)? Tick (?) all right answers or highlight the correct statements with color.
1. The model explains 42.25% of variation in box office collection.
2. There are outliers in the model.
3. The residuals do not follow a normal distribution.
4. The model cannot be used since R-square is low.
5. Box office collection increases as the budget increases.
Question 1.2
Mr Chellappa, CEO of Oho Productions (OP) claims that the regression model in Table 3 is incorrect since it has negative constant value. Comment whether Mr Chellappa is correct in his assessment about the model.
A second model is developed between ln(Box office collection) and movie release time:
Model 2
ln(Y) = β0 + β1 x Release Time FestivalSeason + β2 x Release Time Long Weekend + β3 x Release Time Normal Season + ε
The regression output for Model 2 is given in Table 4.
Table 4 Coefficients
Model
|
Unstandardized Coefficients
|
Standardized Coefficients
|
t
|
Sig.
|
B
|
Std. Error
|
Beta
|
(Constant)
|
2.685
|
.396
|
|
6.776
|
.000
|
Releasing_Time_Festival_Season
|
.727
|
.568
|
.136
|
1.278
|
.203
|
Releasing_Time Long_Weekend
|
1.247
|
.588
|
.221
|
2.122
|
.036
|
Releasing_Time Normal_Season
|
.147
|
.431
|
.041
|
.340
|
.734
|
a. Dependent Variable: Ln(Box Office Collection)
Question 1.3
What is the average difference in the box office collection when a movie is released during a holiday season (Releasing_Time_holiday_season) versus movies released during normal season (Releasing_Time_Normal_Season)? Use a significance value of 5%.
Question 1.4
Mr Chellappa of Oho productions claims that the movies released during long weekend (Releasing_Time_Long_Weekend) earn at least 5 crores more than the movies released during normal season (Releasing_Time_Normal_Season). Check whether this claim is true (use α = 0.05).
A stepwise regression model is developed between ln(Box Office Collection) and all the predictor variables listed in Table 1. The outputs are shown in Tables 5-6.
Table 5 Model Summary
Model
|
R
|
R Square
|
Adjusted R Square
|
Std. Error of the Estimate
|
1
|
.709a
|
.503
|
.499
|
1.20651
|
2
|
.763b
|
.581
|
.576
|
1.11050
|
3
|
.787c
|
.620
|
.612
|
1.06210
|
4
|
.802d
|
.643
|
.633
|
1.03307
|
5
|
.810e
|
|
|
1.01749
|
6
|
|
|
|
|
Table 6. Coefficients in the model (in the order in which it was added to the model)
Model
|
|
Unstandardized Coefficients
|
Standardized Coefficients
|
T
|
Correlations
|
|
|
B
|
Std. Error
|
Beta
|
Zero-order (direct)
|
Partial
|
Part
|
|
(Constant)
|
3.573
|
.249
|
|
14.346
|
|
|
|
|
Budget_35_Cr
|
1.523
|
.207
|
.443
|
7.342
|
.709
|
.525
|
.356
|
|
Youtube_Views
|
1.1710-07
|
.000
|
.242
|
4.426
|
.538
|
.348
|
.214
|
Step 6
|
Prod_House_CAT A
|
.562
|
.185
|
.165
|
3.033
|
.444
|
.247
|
.147
|
|
Music_Dir_CAT C
|
-.645
|
.199
|
-.177
|
-3.245
|
-.483
|
-.263
|
-.157
|
|
GenreComedy
|
.456
|
.197
|
.115
|
2.312
|
.006
|
.190
|
.112
|
|
Director_CAT C
|
-.434
|
.203
|
-.123
|
-2.143
|
-.509
|
-.177
|
-.104
|
Question 1.5
What is the variation in response variable, ln(Box office collection), explained by the model after adding all 6 variables?
Question 1.6
Which factor has the maximum impact on the box office collection of a movie? What will be your recommendation to a production house based on the variable that has maximum impact on the box office collection?
Question 1.7
Compare the regressions in Model 2 (Table 4) and Model 3 (Tables 5 and 6). None of the variables in Model 2 are statistically significant in Model 3. Can we conclude that the variables in Model 2 have no association relationship with Box Office Collection? Explain clearly.
Question 1.8
Among the variables in Table 6, which variable is not useful for practical application of the model? Clearly state your reasons.
Question 2
The yearly US Sales of domestically produced cars is collected for the period 1970-1999, along with the data on the following:
PriceIndex - CPI for Transportation
Income - Total Disposable Income in the US (billions of dollars) Interest - Prime Interest Rate (%) Charged by Banks
|
Year
|
Sales
|
PriceIndex
|
Income
|
Interest
|
Year
|
1
|
|
|
|
|
Sales
|
-0.5453
|
1.0000
|
|
|
|
PriceIndex
|
|
-0.6089
|
1.0000
|
|
|
Income
|
|
-0.5033
|
|
1.0000
|
|
Interest
|
|
-0.3842
|
|
|
1.0000
|
SPSS was used to carry out Stepwise Regression in order to predict Sales. The summary of the models fitted in the first 2 steps; the ANOVA table and Coefficients table obtained are given below.
Model Summary
Model
|
R
|
R Square
|
Adjusted R Square
|
Std. Error of the Estimate
|
1
|
|
|
|
|
2
|
.708b
|
.502
|
.465
|
760459.004
|
a. Predictors: (Constant),
b. Predictors: (Constant), , Interest
c. Dependent Variable: Sales
ANOVA
Model
|
Sum of Squares
|
df
|
Mean Square
|
F
|
Sig.
|
1
|
Regression
|
|
|
|
|
|
|
Residual
|
|
Total
|
2
|
Regression
|
1.571E13
|
|
|
|
.000b
|
|
Residual
|
1.561E13
|
|
Total
|
3.132E13
|
Coefficients
Model
|
Unstandardized Coefficients
|
Standardized Coefficients
|
t
|
Sig.
|
B
|
Std. Error
|
Beta
|
1
|
(Constant)
|
9102897.600
|
433224.149
|
-.609
|
21.012
|
.000
|
|
|
-17258.553
|
4248.694
|
-4.062
|
.000
|
2
|
(Constant)
|
1.023E7
|
576427.867
|
|
17.740
|
.000
|
|
|
-16873.654
|
3853.737
|
-.595
|
-4.379
|
.000
|
|
Interest
|
-124592.212
|
46820.081
|
-.362
|
-2.661
|
.013
|
a. Dependent Variable: Sales
Question 2.1
a) What is the predictor variable used in Model 1? Explain clearly.
b) What proportion of variation in Sales does this predictor variable explain in model 1? Explain clearly.
c) What is the Std. Error of the Estimate for Model 1? Explain clearly.
Question 2.2
a) What is the magnitude of the semipartial (or part) correlation for the variable ‘Interest' in Model 2? Explain.
b) Carry out an appropriate test, at 95% confidence level, to determine if Model 2 as a whole is valid (significant). State the null and alternate hypotheses and show all work.
c) Given no change in the other significant explanatory variables, can it be concluded from Model 2 that ‘Interest' has a higher impact on ‘Sales' than the other variable used in the model. Explain clearly.
Question 2.3: Can it be concluded, at 95% confidence level, that an increase in ‘Interest' rate by 5% decreases yearly Sales by at least 250000 units or more? Show all work.
Question 2.4: What can you say about the relationship between ‘Interest' and the other predictor variable used in Models 1 and 2? Explain clearly.
Question 2.5: The partial correlation of the excluded variables; after Model 2 was fitted; are 0.184 and Conduct an appropriate test, at 95% confidence level, to determine if one of these excluded variables should be added to the regression model. State the null and alternate hypotheses and show all work.
Question 3:
A large grocery store in the US wishes to understand the key drivers that determine the amount spent per transaction by their customers. Therefore, it obtained a random sample of 4000 transactions with information on the amount spent (Revenue), the product category on which the transaction was made (Product Family), the annual income of the customer (Annual income), the number of children in the household the customer belongs to, and finally whether the customer owns a home or not. A ‘snapshot' of part of the data is provided below.
Homeowner
|
Children
|
Annual Income
|
Product Family
|
Revenue
|
Y
|
2
|
$30K - $50K
|
Food
|
$27.38
|
Y
|
5
|
$70K - $90K
|
Food
|
$14.90
|
N
|
2
|
$50K - $70K
|
Food
|
$5.52
|
Y
|
3
|
$30K - $50K
|
Food
|
$4.44
|
Y
|
3
|
$130K - $150K
|
Drink
|
$14.00
|
Y
|
3
|
$10K - $30K
|
Food
|
$4.37
|
Y
|
2
|
$30K - $50K
|
Food
|
$13.78
|
Y
|
2
|
$150K +
|
Food
|
$7.34
|
Y
|
3
|
$10K - $30K
|
Non-Consumable
|
$2.41
|
N
|
1
|
$50K - $70K
|
Non-Consumable
|
$8.96
|
N
|
0
|
$30K - $50K
|
Food
|
$11.82
|
In order to enable regression analysis, the following indicator (dummy) variables were created: Own_Hm = 1 (Yes to Homeowner), 0 otherwise,
Ann_Inc2 = 1 (Annual Income in the range $30K - $50K), 0 otherwise Ann_Inc3 = 1 (Annual Income in the range $50K - $70K), 0 otherwise
Ann_Inc4 = 1 (Annual Income in the range $70K - $90K), 0 otherwise
Ann_Inc5 = 1 (Annual Income in the range $90K and above), 0 otherwise Prod_Fam2 = 1 (Product Family is Drink), 0 otherwise
Prod_Fam3 = 1 (Product Family is Non-Consumable), 0 otherwise.
The following outputs were generated using this data
Regression Output 1 (Revenue($) Response Var)
Regression Statistics
Multiple R
|
0.0340
|
R Square
|
0.0012
|
Adjusted R Square
|
0.0002
|
Standard Error
|
8.1499
|
Observations
|
4000.0000
|
ANOVA
|
df
|
SS
|
MS
|
F
|
Significance F
|
Regression
|
4.0000
|
306.4494
|
76.6123
|
1.1535
|
0.3294
|
Residual
|
3995.0000
|
265348.3672
|
66.4201
|
|
|
Total
|
3999.0000
|
265654.8166
|
|
|
|
|
Coefficients
|
Standard Error
|
t Stat
|
P-value
|
|
Intercept
|
12.6841
|
0.2786
|
45.5352
|
0.0000
|
|
Ann_Inc2
|
0.2787
|
0.3588
|
0.7766
|
0.4374
|
|
Ann_Inc3
|
0.7617
|
0.4106
|
1.8551
|
0.0637
|
|
Ann_Inc4
|
0.4524
|
0.4712
|
0.9602
|
0.3370
|
|
Ann_Inc5
|
-0.0130
|
0.4229
|
-0.0307
|
0.9755
|
|
Regression Output 2 (Revenue($) Response Var)
Regression Statistics
|
Multiple R
|
0.059
|
R Square
|
0.004
|
Adjusted R Square
|
0.003
|
Standard Error
|
8.137
|
Observations
|
4000.000
|
ANOVA
|
|
|
|
|
|
|
df
|
SS
|
MS
|
F
|
Significance F
|
Regression
|
1.000
|
933.834
|
933.834
|
14.103
|
0.000
|
Residual
|
3998.000
|
264720.983
|
66.213
|
|
|
Total
|
3999.000
|
265654.817
|
|
|
|
|
|
|
|
|
|
|
Coefficients
|
Standard Error
|
t Stat
|
P-value
|
|
Intercept
|
12.136
|
0.255
|
47.564
|
0.000
|
|
Children
|
0.326
|
0.087
|
3.755
|
0.000
|
|
Regression Output 3(Revenue($) Response Var)
Regression Statistics
|
Multiple R
|
0.046
|
|
|
|
|
R Square
|
0.002
|
|
|
|
|
Adjusted R Square
|
0.002
|
|
|
|
|
Standard Error
|
8.144
|
|
|
|
|
Observations
|
4000.000
|
|
|
|
|
ANOVA
|
|
|
|
|
|
|
df
|
SS
|
MS
|
F
|
Significance F
|
Regression
|
2.000
|
550.538
|
275.269
|
4.150
|
0.016
|
Residual
|
3997.000
|
265104.278
|
66.326
|
|
|
Total
|
3999.000
|
265654.817
|
|
|
|
|
Coefficients
|
Standard Error
|
t Stat
|
P-value
|
|
Intercept
|
13.192
|
0.152
|
86.958
|
0.000
|
|
Prod_Fam2
|
-0.975
|
0.458
|
-2.131
|
0.033
|
|
Prod_Fam3
|
-0.743
|
0.332
|
-2.240
|
0.025
|
|
Regression Output 4 (Revenue($) Response Var)
Regression Statistics
|
Multiple R
|
0.075
|
R Square
|
0.006
|
Adjusted R Square
|
0.005
|
Standard Error
|
8.131
|
Observations
|
4000.000
|
ANOVA
|
df |
SS |
MS |
F |
Significance F |
Regression |
3 |
1496.798 |
498.933 |
7.548 |
0 |
Residual |
3996 |
264158.02 |
66.106 |
|
|
Total |
3999 |
265654.82 |
|
|
|
|
Coefficients |
Standard Error |
t Stat |
P-value |
|
Intercept |
12.361 |
0.267 |
46.345 |
0 |
|
Children |
0.329 |
0.087 |
3.783 |
0 |
|
Prod_Fam2 |
-0.992 |
0.457 |
-2.171 |
0.03 |
|
Prod_Fam3 |
-0.747 |
0.331 |
-2.256 |
0.024 |
|
Regression Output 5 (Revenue($) Response Var) Child_Fam3 = Children*Prod_Fam3
Regression Statistics
Multiple R
|
0.080
|
R Square
|
0.006
|
Adjusted R Square
|
0.006
|
Standard Error
|
8.127
|
Observations
|
4000.000
|
ANOVA
|
df
|
SS
|
MS
|
F
|
Sig F
|
Regression
|
3.000
|
1708.947
|
569.649
|
8.624
|
0.000
|
Residual
|
3996.000
|
263945.870
|
66.053
|
|
|
Total
|
3999.000
|
265654.817
|
|
|
|
|
Coefficients
|
Standard Error
|
t Stat
|
P-value
|
|
Intercept
|
12.214
|
0.258
|
47.399
|
0.000
|
|
Children
|
0.393
|
0.090
|
4.379
|
0.000
|
|
Prod_Fam2
|
-1.010
|
0.455
|
-2.218
|
0.027
|
|
Child_Fam3
|
-0.322
|
0.112
|
-2.882
|
0.004
|
|
Use the information given above to answer the following questions.
For each question give adequate explanation and support your answer with given information precisely. Wherever required assume α = 0.05 for significance level.
a) Rank the income groups based on average revenue obtained per transaction in the sample data from largest to smallest. Provide precise reasons as to how you obtained this ranking. Is this ranking valid for the population? What is the average revenue per transaction obtained for the income group ($10K-$30K)?
b) The grocery store wishes to estimate the average amount spent per transaction on non- consumables. Provide the most accurate estimate possible. Provide details on how you obtained this estimate.
c) If in regression output 3, if the base chosen in product family is drinks (Prod_Fam2), then what will be the corresponding prediction equation?
d) Is there a significant difference in the average amount spent per transaction between that on drinks and non-consumables? Why or Why not? Provide precise reasons.
e) The grocery store wishes to target those customers, as well as items on which the amount spent is maximum. Assuming that no customer has more than five children, identify the appropriate customer segment as well as the appropriate product family. Provide precise reasons behind your answer.
f) What is the chance that a customer with 3 children will spend more than $10.00 on food items per transaction? Provide details on your calculations.
g) Do the number of children effect food purchases more than non-consumables? Why or why not? State your reasons precisely. (3 points)
h) If the grocery store has reason to believe that in addition to the independent variables considered in Regression Output 4, homeowners spend significantly more on non-consumables than non- home owners on any product category. If so, how will you modify the model provided in Regression Output 4? Provide the model in β terms. If you are adding new variables to the model, provide details on what you expect the β value to be. Positive? Negative?
Question 4
Go through the case, "Oakland A" and the spreadsheet supplement (Ref: Moodle/Cases and Materials/Module 3). Does mark Nobel increase attendance? If so, how much is the increase worth for Oakland? Support your decision through an appropriate regression model.
Question 5
Box office success of Bollywood movies was analysed using the following variables using logistic regression model. The data model is provided in the following table.
Sl. No
|
Variable
|
Variable Type
|
Code in SPSS output
|
1
|
Box office success (Y)
|
Categorical
|
1 = Success
0 = Failure
|
2
|
Release Data
|
Categorical with 4 levels
|
1 = Festival Season (FS) 2 = Holiday Season (HS)
|
|
|
|
3 = Long Weekend (LW) 4 = Other Season (OS)
|
3
|
Genre
|
Categorical with 5 levels
|
1 = Action (Action) 2 = Drama (Drama)
3 = Romance (Romance) 4 = Comedy (Comedy)
5 = Others (Other-G)
|
4
|
Movie Content
|
Categorical with 3 levels
|
Masala (Masala) Sequel (Sequel) Others (Other_C)
|
5
|
Director Category
|
Categorical with 3 levels
|
Director_A Director_B Director_O
|
6
|
Lead Actor Category
|
Categorical with 3 levels
|
Actor_A Actor_B Actor_O
|
7
|
Item Song
|
Binary variable
|
1 (Movie has an item song) 0 (otherwise)
|
8
|
Budget
|
Numerical (in crores of rupees)
|
Budget
|
9
|
YouTube Views
|
Numerical
|
YouTube-V
|
10
|
YouTube Likes
|
Numerical
|
YouTube-L
|
11
|
YouTube Dislikes
|
Numerical
|
YouTube-D
|
A logistic regression model was developed using Budget as independent variable and box office success as the dependent variable (ln(Π/(1-Π) = β0 + β1 x Budget.
The SPSS model-output is shown below (Tables 1-3)
Table 1 Omnibus Tests of Model Coefficients
|
Chi-square
|
df
|
Sig.
|
|
Step
|
4.000
|
1
|
.046
|
Step 1
|
Block
|
4.000
|
1
|
.046
|
|
Model
|
4.000
|
1
|
.046
|
Table 2 Classification Table
|
Observed
|
Predicted
|
|
Success Failure
|
Percentage Correct
|
|
0
|
1
|
|
0
|
2
|
17
|
10.5
|
|
SuccessFaliure
|
|
|
|
Step 1
|
1
|
3
|
41
|
93.2
|
|
Overall Percentage
|
|
|
68.3
|
a. The cut value is .500
Table 3 Variables in the Equation
|
B
|
S.E.
|
Wald
|
df
|
Sig.
|
Exp(B)
|
Step 1a
|
Budget
|
-.016
|
.008
|
3.825
|
1
|
.050
|
.984
|
|
Constant
|
1.621
|
.503
|
10.395
|
1
|
.001
|
5.058
|
a. Variable(s) entered on step 1: Budget.
Question 5.1
Calculate the budget for which the box office success and failure are equally likely.
Question 5.2
Is there a sufficient evidence to conclude that the higher budget movies are more likely to fail at the box- office?
Question 5.3
A production house is making a movie with 100 crore budget; what is the success probability for this movie?
Question 5.4
Calculate the optimal cut-off probability when the cost of classifying failure at box office (0) as success at the box office (1) is five times costlier than the cost of classifying success (1) as failure (0). Show all calculations.
Step number: 1
|
Observed
|
Groups
|
and
|
Predicted
|
Probabilities
|
|
|
8
|
+
|
|
|
|
|
|
|
|
|
|
|
+
|
|
|
I
|
|
|
|
|
|
|
|
|
|
|
I
|
|
|
I
|
|
|
|
|
|
|
1
|
|
|
|
I
|
F
|
|
I
|
|
|
|
|
|
|
1
|
|
|
|
I
|
R
|
6
|
+
|
|
|
|
|
|
|
1
|
|
1
|
1
|
+
|
E
|
|
I
|
|
|
|
|
|
|
1
|
|
1
|
1
|
I
|
Q
|
|
I
|
|
|
|
|
|
|
1
|
1
|
1
|
1
|
I
|
U
|
|
I
|
|
|
|
|
|
|
1
|
1
|
1
|
1
|
I
|
E
|
4
|
+
|
|
|
|
|
|
|
0
|
0
|
11
|
111
|
+
|
N
|
|
I
|
|
|
|
|
|
|
0
|
0
|
11
|
111
|
I
|
C
|
|
I
|
|
|
|
|
|
1
|
0
|
0
|
11
|
1111
|
I
|
Y
|
|
I
|
|
|
|
|
|
1
|
0
|
0
|
11
|
1111
|
I
|
2
|
+
|
|
|
1
|
1
|
1
|
1
|
|
1
|
0
|
0
|
001111111
|
+
|
|
I
|
|
|
1
|
1
|
1
|
1
|
|
1
|
0
|
0
|
001111111
|
I
|
|
I
|
0 1
|
1 0 1
|
0
|
0
|
0
|
0
|
1
|
1
|
0
|
10
|
000111111
|
I
|
|
I
|
0 1
|
1 0 1
|
0
|
0
|
0
|
0
|
1
|
1
|
0
|
10
|
000111111
|
I
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Predicted ---------+---------+---------+---------+---------+---------+-------+---------+---------+---------- Prob: 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
Group: 00000000000000000000000000000000000000000000000000111111111111111111111111111111111111111111111111
Predicted Probability is of Membership for 1 The Cut Value is .50
Symbols: 0 - 0; 1 - 1; Each Symbol Represents .5 Cases.
Figure 1. Classification plot for model 1
A second model is developed using the variable, "item song", the SPSS output is shown in tables 4-5.
Table 4 Classification Table
|
Observed
|
Predicted
|
|
SuccessFaliure
|
Percentage Correct
|
|
0
|
1
|
|
0
|
11
|
8
|
57.9
|
|
SuccessFaliure
|
|
|
|
Step 1
|
1
|
20
|
24
|
54.5
|
|
Overall Percentage
|
|
|
55.6
|
a. The cut value is .700
Table 5 Variables in the Equation
|
B
|
S.E.
|
Wald
|
df
|
Sig.
|
Exp(B)
|
Step 1a
|
ItemSong
|
-.501
|
.202
|
6.151
|
1
|
.013
|
.606
|
|
Constant
|
1.099
|
.408
|
7.242
|
1
|
.007
|
3.000
|
a. Variable(s) entered on step 1: ItemSong.
Question 5.5
Calculate the difference in success probabilities for movies with item song and movies without item song.
Question 5.6
Which is a better model (budget as an independent variable vs item song as an independent variable). Clearly state your reasons.
A stepwise logistic regression model is shown in tables 6 and 7 using significance α = 0.10. 35_Cr_Budget is a derived variable which takes value 1 if the movie budget is more than 35 crores and 0 otherwise.
Table 6 Classification Table
|
Observed
|
Predicted
|
|
SuccessFaliure
|
Percentage Correct
|
|
0
|
1
|
Step 1
|
SuccessFaliure
|
0
|
14
|
5
|
73.7
|
1
|
17
|
27
|
61.4
|
Overall Percentage
|
|
|
65.1
|
Step 2
|
SuccessFaliure
|
0
|
14
|
5
|
73.7
|
1
|
10
|
34
|
77.3
|
Overall Percentage
|
|
|
76.2
|
Step 3
|
SuccessFaliure
|
0
|
12
|
7
|
63.2
|
1
|
9
|
35
|
79.5
|
Overall Percentage
|
|
|
74.6
|
Step 4
|
SuccessFaliure
|
0
|
13
|
6
|
68.4
|
1
|
9
|
35
|
79.5
|
Overall Percentage
|
|
|
76.2
|
Step 5
|
SuccessFaliure
|
0
|
15
|
4
|
78.9
|
1
|
9
|
35
|
79.5
|
Overall Percentage
|
|
|
79.4
|
Step 6
|
SuccessFaliure
|
0
|
13
|
6
|
68.4
|
1
|
10
|
34
|
77.3
|
Overall Percentage
|
|
|
74.6
|
a. The cut value is .700
Table 7 Variables in the Equation
|
B
|
S.E.
|
Wald
|
df
|
Sig.
|
Exp(B)
|
Step 1a
|
35_Cr_Budget
|
-1.492
|
.606
|
6.063
|
1
|
.014
|
.225
|
|
Constant
|
1.686
|
.487
|
11.998
|
1
|
.001
|
5.400
|
|
YoutubeL
|
.000
|
.000
|
4.294
|
1
|
.038
|
1.000
|
Step 2b
|
35_Cr_Budget
|
-2.227
|
.694
|
10.285
|
1
|
.001
|
.108
|
|
Constant
|
1.108
|
.550
|
4.055
|
1
|
.044
|
3.028
|
|
Budget
|
-.027
|
.017
|
2.356
|
1
|
.125
|
.974
|
Step 3c
|
YoutubeL
|
.000
|
.000
|
5.903
|
1
|
.015
|
1.000
|
35_Cr_Budget
|
-1.243
|
.911
|
1.860
|
1
|
.173
|
.289
|
|
Constant
|
1.596
|
.624
|
6.554
|
1
|
.010
|
4.935
|
|
Budget
|
-.034
|
.020
|
2.877
|
1
|
.090
|
.967
|
|
YoutubeL
|
.000
|
.000
|
6.858
|
1
|
.009
|
1.000
|
Step 4d
|
DirectorA
|
1.544
|
.890
|
3.008
|
1
|
.083
|
4.683
|
|
35_Cr_Budget
|
-1.621
|
.981
|
2.730
|
1
|
.098
|
.198
|
|
Constant
|
1.556
|
.650
|
5.733
|
1
|
.017
|
4.742
|
|
Budget
|
-.032
|
.021
|
2.393
|
1
|
.122
|
.969
|
|
YoutubeL
|
.000
|
.000
|
7.067
|
1
|
.008
|
1.000
|
Step 5e
|
DirectorA
|
1.669
|
.902
|
3.427
|
1
|
.064
|
5.308
|
ActorA
|
-1.327
|
.934
|
2.019
|
1
|
.155
|
.265
|
|
35_Cr_Budget
|
-1.046
|
1.038
|
1.015
|
1
|
.314
|
.351
|
|
Constant
|
1.972
|
.774
|
6.492
|
1
|
.011
|
7.187
|
|
Budget
|
-.043
|
.018
|
5.579
|
1
|
.018
|
.958
|
|
YoutubeL
|
.000
|
.000
|
7.370
|
1
|
.007
|
1.000
|
Step 6e
|
DirectorA
|
1.602
|
.895
|
3.206
|
1
|
.073
|
4.961
|
|
ActorA
|
-1.622
|
.862
|
3.543
|
1
|
.060
|
.197
|
|
Constant
|
2.132
|
.745
|
8.177
|
1
|
.004
|
8.429
|
a. Variable(s) entered on step 1: lt_35_Cr_Budget.
b. Variable(s) entered on step 2: YoutubeL.
c. Variable(s) entered on step 3: Budget.
d. Variable(s) entered on step 4: DirectorA.
e. Variable(s) entered on step 5: ActorA.
Question 5.7
Consider all the information in tables 1 to 7, which model you would recommend to predict the movie success at the box office? Clearly state your reasons.
Question 6
Read the case,"Breaking Barriers - Micro-mortgage analytics". Using the data provided, develop a credit rating model that Shubham can use. (Ref: Moodle/Cases and Materials/Module 3)
Question 7
A Micro-Mortgage company classifies customers into three categories (1, 2 and 3). Category 1 applicants are denied loan, Category 2 applicants are charged an interest rate of 14% per annum and Category 3 applicants are charged an interest rate of 18%. The variables considered in the model are shown below:
Sl. No
|
Variable
|
Variable Type
|
Code in SPSS output
|
1
|
Customer Classification
|
Categorical (3 levels)
|
1 = Category 1
2 = Category 2
3 = Category 3
|
2
|
Disposable Income
|
Numerical
|
DI
|
3
|
Loan to Value ratio
|
Numerical
|
LTV
|
4
|
Instalment to Income Ratio
|
Numerical
|
IIR
|
5
|
Marital Status
|
Categorical
|
MS = 1 = Married MS = 0 = Unmarried
|
6
|
Age
|
Numerical
|
Age
|
7
|
Old Emi
|
Categorical
|
Old Emi = 1 applicant with old EMI
Old emi = 0 otherwise
|
SPSS regression output using category 3 as base category is provided below:
Match_Oa
|
B
|
Standar d Error
|
Wald
|
Sig.
|
Exp(B)
|
1
|
Intercept
|
1.720
|
0.850
|
4.095
|
0.043
|
5.585
|
DI
|
-0.120
|
0.050
|
5.760
|
0.016
|
0.887
|
LTV
|
-0.521
|
0.260
|
4.015
|
0.045
|
0.594
|
IIR
|
-0.220
|
0.100
|
4.840
|
0.028
|
0.803
|
MS
|
0.850
|
0.620
|
1.880
|
0.170
|
2.340
|
AGE
|
-0.340
|
0.240
|
2.007
|
0.157
|
0.712
|
OLD EMI
|
-1.120
|
0.390
|
8.247
|
0.004
|
0.326
|
2
|
Intercept
|
0.650
|
0.280
|
5.389
|
0.020
|
1.916
|
DI
|
-0.580
|
0.270
|
4.615
|
0.032
|
0.560
|
LTV
|
0.960
|
0.850
|
1.276
|
0.259
|
2.612
|
IIR
|
0.540
|
0.320
|
2.848
|
0.092
|
1.716
|
MS
|
0.710
|
0.330
|
4.629
|
0.031
|
2.034
|
AGE
|
-0.220
|
0.150
|
2.151
|
0.142
|
0.803
|
OLD EMI
|
0.150
|
0.850
|
0.031
|
0.860
|
1.162
|
The reference category is 3.
Question 7.1
Comment whether the marital status has any statistical significance on the probability of loan denial. Clearly state your reasons.
Question 7.2
What percentage of the applicants with a DI=20, LTV = 0.5, IIR = 0.8, MS = 0 and Old EMI = 0 will be given a loan at 18% interest? Use only statistically significant variables and assume that the changes in the coefficient values are negligible due to dropping of insignificant variables.
Question 8
Read the case, "Fraud analytics at MCA technology solutions - Predicting Earnings Manipulation by Indian Firms". Develop a model using logistic regression and Random Forest to predict fraudulent transactions. (Ref: Moodle/Cases and Materials/Module 3)