here is a procedure you can use when developing a


Here is a procedure you can use when developing a regression model. 

Regression Methodology

1) Collect data.  This will be the most time consuming part of your job.  Be sure to consider dummy variables and time.  Be selective (Remember garbage in = garbage out).

Sources:

Interviews, focus groups

Market Studies

Market Experiments (can be very valuable)

Gvt. Data

Industry trade association data

2) Check for problems with covariance - different explanatory variables may be correlated and fighting each other in the model.  For example - GDP vs. housing starts vs. Income.  Pick the best of all variables that give you the same information.  You can check covariance on Excel.

3) Regress your data first as a linear model

4) Check t-ratios and p-values of your coefficients.  Make sure t-ratios are at least 1.75-2.0 (choice is up to you, decide on your desired confidence level).  P-values tell you exactly what confidence level is.

5) Check R2 .  This tells you percent of variability you have explained with your model.  If it is very low - try adding more explanatory variables and/or a transformation.  For multi-variable can also use corrected R2

6) Check p-value on f for regression.  This tells you confidence that your model is statistically valid.  Note the standard error - the smaller this is the less error you will have in your forecast. SE =standard deviation of the sample mean.

7) Weed out bad variables and develop a model.

8) Try a transformation and see if your diagnostics improve. (A detailed residual analysis is necessary to completely analyze your model - refer to a stats book for this.  Diagnostic Plots like standardized residuals vs fitted values, nscores.  Check for autocorrelation for time series.  Check for heteroscedasticity (variable variance of residuals) - can often reduce with proper transformation.  Check normal probability plots.  Check outliers.  We will not go into this level of detail in this course)

CAVEATS:

1)  R2 will increase as you add more explanatory variables - this is not necessarily a better model.  You are not trying to maximize R2.  You are trying to develop a simple accurate model that predicts demand.

2) collecting more bad data will only improve your t-ratios

3) avoid collecting too many explanatory variables - they might mask the effect of the important ones

Sample Regression Model

Here is a detailed example of the development of a regression model

You open a jewelry store in Naperville.  You make most of your money from diamond sales.  You want to better understand and forecast your diamond demand:

  1. Data collection:  What variables do you think might influence your demand for diamonds?  You develop the following list:

income

 

price

 

wealth

 

GDP

 

stock market

price of diamonds at WalMart

demographics -age

wedding stats

time

 

advertising

season (high in holiday time)

You opened your store on Jan 1, 2002 and have gathered the following data.  

carats/mo

$/carat

$/carat

$/mo

Naperville

$/yr

$B

 

 

Demand

Price

Pwalmart

adv $

Ave Age

Income

GDP

Dow Jones

month

47

1500

1200

200

50

35000

1155

11000

Jan

51

1500

1000

200

49

35350

1260

12000

Feb

40

1500

800

250

48.02

35703.5

1207.5

11500

Mar

121

900

1500

225

47.0596

36060.54

1102.5

10500

Apr

127

900

1600

180

46.11841

36421.14

1080.45

10290

May

121

900

1400

120

45.19604

36785.35

1058.841

10084.2

Jun

117

900

1000

500

44.29212

37153.21

1037.664

9882.516

Jul

73

1200

650

126

43.40628

37524.74

1016.911

9684.866

Aug

89

1200

890

600

42.53815

37899.98

996.5727

9491.168

Sep

77

1200

658

140

41.68739

38278.98

976.6412

9301.345

Oct

92

1200

986

220

40.85364

38661.77

957.1084

9115.318

Nov

92

1200

659

120

40.03657

39048.39

937.9662

8933.012

Dec

62

1500

888

50

39.23584

39438.88

919.2069

8754.352

Jan

77

1500

1020

125

38.45112

39833.26

900.8228

8579.264

Feb

55

1500

325

562

37.6821

40231.6

882.8063

8407.679

Mar

88

1500

1500

300

36.92846

40633.91

865.1502

8239.526

Apr

131

1100

1500

251

36.18989

41040.25

847.8472

8074.735

May

112

1100

800

401

35.46609

41450.66

830.8902

7913.24

Jun

115

1100

800

200

34.75677

41865.16

814.2724

7754.976

Jul

131

1100

1350

150

34.06163

42283.81

797.987

7599.876

Aug

140

1100

1350

687

33.3804

42706.65

782.0272

7447.879

Sep

119

1100

800

200

32.71279

43133.72

766.3867

7298.921

Oct

128

1100

825

500

32.05853

43565.06

751.059

7152.943

Nov

148

950

789

140

31.41736

44000.71

736.0378

7009.884

Dec

156

950

1200

263

30.78902

44440.71

721.317

6869.686

Jan

166

950

1200

241

30.17324

44885.12

706.8907

6732.292

Feb

175

950

1600

523

29.56977

45333.97

692.7529

6597.646

Mar

146

950

600

526

28.97838

45787.31

678.8978

6465.694

Apr

126

1200

600

641

28.39881

46245.18

665.3199

6336.38

May

94

1500

600

320

27.83083

46707.64

652.0135

6209.652

Jun

118

1300

600

265

27.27422

47174.71

638.9732

6085.459

Jul

122

1250

600

145

26.72873

47646.46

626.1937

5963.75

Aug

131

1250

700

285

26.19416

48122.92

613.6699

5844.475

Sep

138

1200

700

521

25.67027

48604.15

601.3965

5727.585

Oct

You select the following variables:

Price

Pwm

adv $

Dec

Feb

mos in business

GDP

Dow Jones

Ave Age

Income

 

 

 

 

 

 

 

 This is your regression output:

SUMMARY OUTPUT

 

 

 

 

 

 

 

 

 

 

 

 

Regression Statistics

 

 

 

 

 

Multiple R

0.999662

 

 

 

 

 

R Square

0.999324

 

 

 

 

 

Adjusted R Square

0.99903

 

 

 

 

 

Standard Error

1.080827

 

 

 

 

 

Observations

34

 

 

 

 

 

 

 

 

 

 

 

 

ANOVA

 

 

 

 

 

 

 

df

SS

MS

F

 

 

Regression

10

39723.51

3972.351

3400.44

 

 

Residual

23

26.86831

1.168188

 

 

 

Total

33

39750.38

 

 

 

 

 

 

 

 

 

 

 

 

Coefficients

Standard Error

t Stat

P-value

Upper 95.0%

 

Intercept

218.594

279.748

0.781396

0.442539

797.2961

 

Price

-0.09904

0.001088

-91.0638

6.28E-31

-0.09679

 

Pwm

0.030112

0.000706

42.67671

2.1E-23

0.031571

 

adv $

0.01156

0.001222

9.457305

2.17E-09

0.014089

 

Dec

9.288621

0.887907

10.46125

3.24E-10

11.12539

 

Feb

8.619078

0.763213

11.29314

7.35E-11

10.1979

 

mos in business

5.951601

8.343354

0.713334

0.482815

23.21112

 

GDP

-0.35999

12383.11

-2.9E-05

0.999977

25616.01

 

Dow Jones

0.03772

1300.226

2.9E-05

0.999977

2689.757

 

Ave Age

2.211563

3.854305

0.57379

0.571679

10.18479

 

Income

-0.00506

0.013574

-0.37298

0.712582

0.023018

 

What do you think??

GDP, Dow Jones and Income look insignificant, but is that reasonable?

The problem is that they are fighting each other.  They each have similar information.  You only need one of these variables.  You need to think about which makes the most sense or test and see which one works best alone in model.

Can check covariance (variables that are correlated with each other).  This is undesirable.  Each explanatory variable should give you new information. 

In excel, it's also under tools, data analysis, covariance.  Highlight data you want to check.

Covariance table:

 

Price

Pwm

adv $

Dec

Dow Jones

Ave Age

Income

Price

43194

 

 

 

 

 

 

Pwm

-26060

117538

 

 

 

 

 

adv $

-4557

-6724

30249

 

 

 

 

Dec

-6

-14

-10

0

 

 

 

Feb

12

10

-10

0

 

 

 

mos in business

-263

-1076

541

0

 

 

 

GDP

6630

17638

-9701

-1

305052

 

-707947

Dow Jones

63141

167984

-92392

-14

2905255

 

 

Ave Age

212

769

-397

0

12203

52

 

Income

-102177

-449231

222270

5

-6742349

-28900

16335056

The higher the number, the greater the covariance.  Income and Dow Jones have strong correlation.  As do Income and GDP and GDP and Dow Jones.  We just need one of these variables.

Let's keep only income as this is logical.  Re-run the regression:

SUMMARY OUTPUT

 

 

 

 

 

 

 

 

 

 

 

 

Regression Statistics

 

 

 

 

 

Multiple R

0.999662

 

 

 

 

 

R Square

0.999324

 

 

 

 

 

Adjusted R Square

0.999108

 

 

 

 

 

Standard Error

1.036802

 

 

 

 

 

Observations

34

 

 

 

 

 

 

 

 

 

 

 

 

ANOVA

 

 

 

 

 

 

 

df

SS

MS

F

 

 

Regression

8

39723.51

4965.439

4619.188

 

 

Residual

25

26.87398

1.074959

 

 

 

Total

33

39750.38

 

 

 

 

 

 

 

 

 

 

 

 

Coefficients

Standard Error

t Stat

P-value

Upper 95.0%

 

Intercept

214.2703

261.6642

0.818875

0.420595

753.1774

 

Price

-0.09905

0.001025

-96.6054

1.08E-33

-0.09694

 

Pwm

0.030119

0.000669

45.02481

1.87E-25

0.031497

 

adv $

0.011552

0.001167

9.900353

3.92E-10

0.013955

 

Dec

9.285746

0.85082

10.91388

5.34E-11

11.03804

 

Feb

8.598333

0.674088

12.75551

1.92E-12

9.986641

 

mos in business

5.805573

7.746753

0.74942

0.460598

21.7603

 

Ave Age

2.122573

3.488305

0.608483

0.548357

9.306867

 

Income

-0.00483

0.012633

-0.38262

0.705234

0.021185

 

Income still looks bad - check covariance again.

 

Price

Pwm

adv $

Dec

Income

Price

43194.2

 

 

 

 

Pwm

-26060

117537.7

 

 

 

adv $

-4556.53

-6723.71

30249.34

 

 

Dec

-6.40138

-13.9689

-9.96021

0.055363

 

Feb

11.72145

9.870242

-9.76384

-0.00519

 

mos in business

-262.5

-1075.62

541.0441

0.029412

 

Ave Age

212.2758

769.0929

-397.053

-0.04747

-28899.6

 

Income

-102177

-449231

222270

4.870607

16335056

Income is varying along with advertising and the negative of price and age (should investigate this).

Remove it:

Remove income

 

 

 

 

 

SUMMARY OUTPUT

 

 

 

 

 

 

 

 

 

 

 

 

Regression Statistics

 

 

 

 

 

Multiple R

0.99966

 

 

 

 

 

R Square

0.99932

 

 

 

 

 

Adjusted R Square

0.999137

 

 

 

 

 

Standard Error

1.019641

0.01945

 

 

 

 

Observations

34

 

 

 

 

 

 

 

 

 

 

 

 

ANOVA

 

 

 

 

 

 

 

df

SS

MS

F

 

 

Regression

7

39723.35

5674.764

5458.249

 

 

Residual

26

27.03136

1.039668

 

 

 

Total

33

39750.38

 

 

 

 

 

 

 

 

 

 

 

 

Coefficients

Standard Error

t Stat

P-value

Upper 95.0%

 

Intercept

114.302

14.08797

8.113449

1.36E-08

143.2603

 

Price

-0.09905

0.001008

-98.2357

5.91E-35

-0.09697

 

Pwm

0.030188

0.000634

47.64864

7.83E-27

0.031491

 

adv $

0.011618

0.001135

10.23668

1.3E-10

0.013951

 

Dec

9.328693

0.829424

11.2472

1.74E-11

11.0336

 

Feb

8.649533

0.649737

13.31237

4.06E-13

9.985087

 

mos in business

2.842606

0.208486

13.63453

2.35E-13

3.271155

 

Ave Age

0.792431

0.283334

2.796804

0.009581

1.374834

 

Looks good.  You might question age - as it doesn't make sense.  Might investigate replacing age with income.  This is the art.  We decide to keep the model as is.  This is our demand equation:

Demand = 114.302-0.09905P+0.030188Pwm+0.011618Adv +9.329*Dec+ 8.65*Feb+2.84*Mos+0.79*Age

The above model looks good, but perhaps a transformation would be even better.  Let's try log-linear.  I take logs of all my original data and place in a new table.

 

ln(Demand)

ln(Price)

ln(Pwm)

ln(mos)

ln(age)

ln(Inc)

ln(adv)

Dec

Feb

3.85

7.31

7.09

0.00

3.91

10.46

5.30

0

0

3.93

7.31

6.91

0.69

3.89

10.47

5.30

0

1

3.69

7.31

6.68

1.10

3.87

10.48

5.52

0

0

4.80

6.80

7.31

1.39

3.85

10.49

5.42

0

0

4.84

6.80

7.38

1.61

3.83

10.50

5.19

0

0

4.80

6.80

7.24

1.79

3.81

10.51

4.79

0

0

4.76

6.80

6.91

1.95

3.79

10.52

6.21

0

0

4.29

7.09

6.48

2.08

3.77

10.53

4.84

0

0

4.49

7.09

6.79

2.20

3.75

10.54

6.40

0

0

4.34

7.09

6.49

2.30

3.73

10.55

4.94

0

0

4.52

7.09

6.89

2.40

3.71

10.56

5.39

0

0

4.52

7.09

6.49

2.48

3.69

10.57

4.79

1

0

4.13

7.31

6.79

2.56

3.67

10.58

3.91

0

0

4.34

7.31

6.93

2.64

3.65

10.59

4.83

0

1

4.01

7.31

5.78

2.71

3.63

10.60

6.33

0

0

4.48

7.31

7.31

2.77

3.61

10.61

5.70

0

0

4.88

7.00

7.31

2.83

3.59

10.62

5.53

0

0

4.72

7.00

6.68

2.89

3.57

10.63

5.99

0

0

4.74

7.00

6.68

2.94

3.55

10.64

5.30

0

0

4.88

7.00

7.21

3.00

3.53

10.65

5.01

0

0

4.94

7.00

7.21

3.04

3.51

10.66

6.53

0

0

4.78

7.00

6.68

3.09

3.49

10.67

5.30

0

0

4.85

7.00

6.72

3.14

3.47

10.68

6.21

0

0

5.00

6.86

6.67

3.18

3.45

10.69

4.94

1

0

5.05

6.86

7.09

3.22

3.43

10.70

5.57

0

0

5.11

6.86

7.09

3.26

3.41

10.71

5.48

0

1

5.16

6.86

7.38

3.30

3.39

10.72

6.26

0

0

4.98

6.86

6.40

3.33

3.37

10.73

6.27

0

0

4.84

7.09

6.40

3.37

3.35

10.74

6.46

0

0

4.54

7.31

6.40

3.40

3.33

10.75

5.77

0

0

4.77

7.17

6.40

3.43

3.31

10.76

5.58

0

0

4.80

7.13

6.40

3.47

3.29

10.77

4.98

0

0

4.88

7.13

6.55

3.50

3.27

10.78

5.65

0

0

4.93

7.09

6.55

3.53

3.25

10.79

6.26

0

0

 

SUMMARY OUTPUT

 

 

 

 

 

 

 

 

 

 

Regression Statistics

 

 

 

 

Multiple R

0.986941

 

 

 

 

R Square

0.974052

 

 

 

 

Adjusted R Square

0.965749

 

 

 

 

Standard Error

0.069079

 

 

 

 

Observations

34

 

 

 

 

 

 

 

 

 

 

ANOVA

 

 

 

 

 

 

df

SS

MS

 

 

Regression

8

4.478337

0.559792

 

 

Residual

25

0.119299

0.004772

 

 

Total

33

4.597636

 

 

 

 

 

 

 

 

 

 

Coefficients

Standard Error

t Stat

Lower 95.0%

Upper 95.0%

Intercept

27.29298

0

65535

27.29298

27.29298

ln(Price)

-1.21469

0.085059

-14.2805

-1.38987

-1.0395

ln(Pwm)

0.258325

0.040836

6.325837

0.17422

0.342429

ln(adv)

0.023881

0.022677

1.053086

-0.02282

0.070585

Dec

0.060605

0.055786

1.086398

-0.05429

0.175498

Feb

0.01679

0.044114

0.380596

-0.07406

0.107644

mos in business

0.115075

0.036956

3.113827

0.038963

0.191188

Ave Age

-1.30596

0

65535

-1.30596

-1.30596

Income

-1.09152

0

65535

-1.09152

-1.09152

 

 

 

 

 

 

 

 

Remove income

 

 

 

 

SUMMARY OUTPUT

 

 

 

 

 

 

 

 

 

 

Regression Statistics

 

 

 

 

Multiple R

0.986941

 

 

 

 

R Square

0.974052

 

 

 

 

Adjusted R Square

0.967066

 

 

 

 

Standard Error

0.067738

 

 

 

 

Observations

34

 

 

 

 

 

 

 

 

 

 

ANOVA

 

 

 

 

 

 

df

SS

MS

 

 

Regression

7

4.478337

0.639762

 

 

Residual

26

0.119299

0.004588

 

 

Total

33

4.597636

 

 

 

 

 

 

 

 

 

 

Coefficients

Standard Error

t Stat

Lower 95.0%

Upper 95.0%

Intercept

13.76921

1.097963

12.54069

11.51231

16.0261

ln(Price)

-1.21469

0.083407

-14.5633

-1.38613

-1.04324

ln(Pwm)

0.258325

0.040043

6.451113

0.176014

0.340635

ln(adv)

0.023881

0.022237

1.073942

-0.02183

0.069589

Dec

0.060605

0.054702

1.107913

-0.05184

0.173047

Feb

0.01679

0.043257

0.388134

-0.07213

0.105707

mos in business

0.115075

0.036239

3.175493

0.040586

0.189565

Ave Age

-0.76836

0.155205

-4.95063

-1.08739

-0.44933

 

 

 

 

 

 

 

 

 

remove Dec, Feb and Adv

 

 

 

 

 

 

 

 

 

SUMMARY OUTPUT

 

 

 

 

 

 

 

 

 

 

Regression Statistics

 

 

 

 

Multiple R

0.971709

 

 

 

 

R Square

0.944219

 

 

 

 

Adjusted R Square

0.938641

 

 

 

 

Standard Error

0.092459

 

 

 

 

Observations

34

 

 

 

 

 

 

 

 

 

 

ANOVA

 

 

 

 

 

 

df

SS

MS

 

 

Regression

3

4.341174

1.447058

 

 

Residual

30

0.256462

0.008549

 

 

Total

33

4.597636

 

 

 

 

 

 

 

 

 

 

Coefficients

Standard Error

t Stat

Lower 95.0%

Upper 95.0%

Intercept

10.67027

0.984992

10.83285

8.65865

12.68189

ln(Price)

-1.18737

0.105218

-11.2848

-1.40226

-0.97249

ln(Pwm)

0.237193

0.051394

4.615168

0.132232

0.342154

mos in business

0.282874

0.020884

13.54498

0.240223

0.325524

No improvement - in fact I had to remove variables that seem intuitive.  I will stick with my linear model (I reality I might also investigate squaring some of the variables).

I will stick with my first linear demand model.

Demand = 114.302-0.09905P+0.030188Pwm+0.011618Adv +9.329*Dec+ 8.65*Feb+2.84*Mos+0.79*Age

Now we can forecast our sales.  These are the types of questions I can answer with my model.

  1. What type of demand loss should I expect if I raise my prices?
  2. What level of demand increase should I anticipate if I spend an additional $100 of advertising?
  3. How will my demand changes as incomes rise?  As my customers age?
  4. If I want to open a similar store in Westmont and forecast the following:

Price at Walmart = $1200/carat

Age = 45

Advertising = $500/month

What should expect to sell in the first 3 months of the year (Jan-Mar) if I open the store in Jan?

Let's answer this last one:

I use the demand equation for forecast sales in each month.  Check this yourself on excel and see if you can replicate these results.

 

 

Jan

Feb

Mar

Total

Ave

Pwm =

1200

95.6766

107.1666

101.3566

304.1998

101.3999

Age=

45

 

 

 

 

 

Adv =

500

 

 

 

 

 

Price

1000

 

 

 

 

 

Solution Preview :

Prepared by a verified Expert
Microeconomics: here is a procedure you can use when developing a
Reference No:- TGS0441842

Now Priced at $100 (50% Discount)

Recommended (90%)

Rated (4.3/5)