服务承诺
资金托管
原创保证
实力保障
24小时客服
使命必达
51Due提供Essay,Paper,Report,Assignment等学科作业的代写与辅导,同时涵盖Personal Statement,转学申请等留学文书代写。
51Due将让你达成学业目标
51Due将让你达成学业目标
51Due将让你达成学业目标
51Due将让你达成学业目标私人订制你的未来职场 世界名企,高端行业岗位等 在新的起点上实现更高水平的发展
积累工作经验
多元化文化交流
专业实操技能
建立人际资源圈Simple_Linear_Regression
2013-11-13 来源: 类别: 更多范文
Simple Linear Regression Model
Term Project Assignment #9
Presented to Instructor:
Professor
Quantitative Methods II
QUA-2321 – Section II
Prepared by
D.Radfor1
Student #
August 08, 2008
TABLE OF CONTENTS
EXECUTIVE SUMMARY …………………………………………………………………….. 4
EXPLANATION OF RAW DATA ……………………………………………………………. 5
1. Description
2. What are the independent and dependent variables
HYPOTHESIS ………………………………………………………………………………….. 6
1. How the independent variable effects the dependent variable
INTERPRETATION OF THE LINEAR REGRESSION EQUATION ……………………….. 6
INTERPRETATION OF Syx – Standard Error of Estimate ……………………… 7
CORRELATION COEFFICIENT r – EXPLANATION OF r …………………………………. 8
1. What it means in relation to the data
COEFFICIENT OF DETERMINATION r2 – EXPLANATION OF r2 ……………………….. 8
1. What it means in relation to the data
TESTS OF SIGNIFICANCE …………………………………………………………………… 9
1. Is the correlation coefficient significant'
2. Is the overall regression model valid (F-Test)'
3. Is the regression coefficient significant'
FORECAST OF THE LAST YEARS DEPENDENT VARIABLE ………………………….. 11
1. Calculation of a 95% prediction interval for the predicted value
2. Comparison of the predictions with actual behaviour
EVALUATION OF THE REGRESSION MODEL ………………………………………...... 13
1. Other variables to convert to a Multiple Linear Regression Equation
Summary and Conclusion …………………………………………………………... 13 Bibliography ……………………………………………………………………………... 15
AppendiCES ……………………………………………………………………………….... 16
1 – Scatter Diagram …………………………………………………………………… 16
2 – Calculation of Linear Regression Equation – Least Squares Method ...…………... 16
3 – Calculation of Syx (Standard Error of Estimate) ..………………………………… 17
4 – Calculation Correlation Coefficient r ……………………………………………… 17
4 – Calculation Coefficient of Determination r2 .……………………………………... 18
5 – Procedures and Calculations for Tests of Significance .…………………………… 18
1. Calculation of a 95% prediction interval for your predicted value. ……………….. 20
EXECUTIVE SUMMARY
The following information is a sample model of Simple Linear Regression and Correlation Analysis. In this information we will exam how an independent variable can effect a dependent variable. The independent variable X in this sample will be the number of Software Publisher Establishments there are in Ontario, and how this affects the amount of revenue made by these establishments, which is our dependent variable Y. This will be done by several tests of statistical equations, the linear regression equation, the standard error estimate, the correlation coefficient r, and the coefficient determination r2. There will also be tests of significance done. We will forecast the last year’s dependent variable through a 95% prediction interval, and these predictions will be weighed against actual behaviour of the variable. In this information we will also include what other variables can be used to create a Multiple Linear Regression Equation.
EXPLANATION OF RAW DATA
1. Description:
The following information is a sample model of Simple Linear Regression and Correlation Analysis. In this sample we will be comparing how an independent variable can affect a dependent variable. This data comes from the Statistics Canada Website. The first data source, the independent variable X in this sample will be the number of Software Publisher Establishments there are in Ontario, and how this affects the second data source, the amount of revenue made by these establishments, which is our dependent variable Y.
2. Variables:
This will be done with data that is representing a ten year period by which the independent variable X = (Number of Software Publisher Establishments) has affected the dependent variable Y = (The amount of revenue made by these establishments). This sample will begin at the year 1997, and end in the year 2006.
(TABLE 1)
|Year |Number of Establishments X |Amount of Revenue Y($ in millions) |
|1997 |443 |1,834 |
|1998 |757 |2,330 |
|1999 |927 |3,386.5 |
|2000 |696 |3,141.6 |
|2001 |872 |3,196.6 |
|2002 |835 |3,000.5 |
|2003 |880 |3,330.1 |
|2004 |1343 |3,345.7 |
|2005 |959 |3,228.3 |
|2006 |773 |2,906.2 |
|Totals |8,485 |29,699.5 |
| |∑ X |∑ Y |
Geography=Ontario
North American Industry Classification System (NAICS)=Software publishers
Summary statistics=Number of establishments (units)4
Summary statistics=Operating revenue (dollars)5
Source: Statistics Canada. Table 354-0005 - Summary statistics for software development and computer services (all establishments), by North American Industry Classification System (NAICS), annual, CANSIM (database), Using E-STAT (distributor).
http://estat.statcan.ca/cgi-win/cnsmcgi.exe'
Lang=E&ESTATFile=EStat\English\CII_1_E.htm&RootDir=ESTAT/
(accessed: July 27, 2008)
Hypothesis
In statistical analysis we make a claim, that is, state a hypothesis, collect data, and then use the data to test the claim. We define a statistical hypothesis as “A statement about a population parameter developed for the purpose of testing” (Basic Statistics, 2006, p. 251).
In this sample of Simple Linear Regression, we are showing how that the higher the number of Software Publisher Establishments (our independent variable X) there are, the higher the amount of revenue’s will be made by these establishments (our dependent variable Y). This will show how our dependent variable (Y) will be affected by our independent variable (X). Through our General form of Linear Regression Equation, our b factor, whether it is a positive or negative number can have a great affect on how our independent variable is affected by our dependent variable.
Interpretation of the Linear Regression Equation
General form of Linear Regression Equation – Formula (12-4)
Y’ = a + bX a = 1,585.220891 b = 1.631973022
Y’ = 1,585.220891 + 1.631973022X
Interpretation:
The Y’ reads Y prime, is the predicted value of the Y variable for a selected X value.
a (=1,585.220891) is the Y- intercept. a (=1,585.220891) is the estimated value of Y where the regression line crosses the Y- axis when X is zero. b (=1.631973022) is the slope of the line, or the average change in Y’ for each change of one unit in the independent variable X. X is any value of the independent variable that is selected. The linear regression equation reveals an estimate of the relationship between the two variables in the population.
The regression equation is (Y’ = a + bX), (Y’ = 1,585.220891 + 1.631973022X).
The b factor of 1.631973022 means that for each additional Software Publisher Establishment opened, the amount of revenue made in that year will increase by 1.6 million.
Because b is a positive number, as the number of Software Publisher Establishments increase, the amount of revenue increases, and as the number of Software Publisher Establishments decrease, the amount of revenue decreases. (Reference to Scatter Diagram and regression calculations – Appendices 1)
Interpretation of Syx (Standard Error of Estimate) – Formula (12-8)
Se = √ EY2 – a(EY) – b(EXY) Se = 363.1038368
n – 2
The standard error of estimate measures the dispersion about the regression line. The standard error of estimate is based on squared deviations from the regression line (between each Y and its predicted value, Y’). This regression line represents all the values of Y’. The Se describes how precise the prediction of Y is based on X. Since this Se (= 363.1038368) is large, this means that the data is widely scattered around the regression line and the regression equation will not provide a precise estimate of Y. (Reference calculations in Appendices 2)
Correlation Coefficient r
Explanation of r and what does it mean in relation to your data'
Correlation Coefficient – Formula (12 – 2)
r = n(EXY) – (EX) (EY) .
√[n(EX2) – (EX)2] [n(EY2) – (EY)2
r = 0.735386122
First, you will notice that our correlation of 0.735386122 is a positive number. With this positive number, we interpret this as a direct relationship between the number of Software Publisher Establishments, and the amount of revenue made by these establishments on a yearly base. This will confirm our reasoning based on the scatter diagram in appendices 1. The value of 0.735386122 is very close to 1.00, so we can conclude that there is a strong relationship. So, as the number of Software Publisher Establishments increase, so will the amount of revenue made increase as well. (Reference calculations in Appendices 3)
Coefficient of DETERMINATION r2
Explanation of r2 and what does this mean in relation to your data'
r = 0.735386122
r2 = (0.735386122)2
r2 = 0.540792749 or 54.1%.
In our previous sample of correlation coefficient r, we declared this relationship as strong. These terms, weak, moderate, and strong, have no precise meaning. A measure with a more easily interpreted meaning is the coefficient of determination. We square the correlation coefficient r, in this case is, (0.735386122)2, which will equal our coefficient of determination r2, 0.540792749. Through this we can declare that 54.1 percent of the variation in the amount of revenue made by these establishments on a yearly base is explained by the variation in the number of Software Publisher Establishments. (Reference calculations in Appendices 3)
Tests of Significance
1. Is the correlation coefficient significant' (Formula – (12 - 3)
Ho: p = 0 - two tailed test. Significance level = 0.05 df = 10 – 2 = 8 t = 2.306
H1: p ≠ 0
The null hypothesis Ho will not be rejected if the computed t value falls in the area between plus 2.306 and minus 2.306.
t = r √ n – 2 t = 3.069418135
√ 1 – r2
The computed t (=3.069) value falls within the rejection area outside plus 2.306 and minus 2.306. Therefore, Ho is rejected at the significance level of 0.05, meaning that the correlation in the population is not zero. This does indicate that there is a correlation with respect to the number of Software Publisher Establishments, and the amount of revenue made in the population by these establishments on a yearly bases. (Reference calculations in Appendices 4)
2. Is the Overall regression model valid' (Formula – (13 – 4)
F = SSR / K . F = 9.421327669
SSE / (n – k – 1)
To determine whether our Overall regression model is valid, we will do what is called a “Global Test”. Using the formula above we can test the ability of the independent variable X, to explain the behaviour of the dependent variable Y. Basically, this investigates whether it is possible for our independent variable to have a zero net regression coefficient. Can the amount of explained variation, r2, occur by chance' Our null hypothesis is; Ho: B1 = 0, and the alternative hypothesis is; H1: The B is not 0, since we have only one independent variable.
Using Appendix G to find the critical value of F, we use the numerator of 1 degree of freedom, and the denominator of 8 degrees of freedom, and we come up with the critical value of F = 5.32.
Here, we do not reject the null hypothesis that the regression coefficient is zero if the computed value of F is less than or equal to 5.32. If our computed value of F is greater than 5.32, we reject Ho and accept the alternative hypothesis, H1.
As we can see, our computed value of F = 9.421327669, is greater than our critical value of F = 5.32. Therefore, the null hypothesis of Ho is rejected, and we accept the alternative hypothesis, H1. As well, with our p-value being at 0.514, we can say that the regression coefficient is not zero, so it is unlikely that Ho is true, indicating that our Overall regression model is valid. (Reference calculations in Appendices 5)
(ANOVA TABLE - TABLE 2)
|Regression Analysis | | | | | | |
| | | | | | | | |
| |r² |0.541 |n |10 | | | |
| |r |0.735 |k |1 | | | |
| |Std. Error |363.104 |Dep. Var. |Y | | | |
| | | | | | | | |
|ANOVA table | | | | | | | |
|Source |SS |df |MS |F |p-value | | |
|Regression | 1,242,149.2569 |1 |1,242,149.2569 |9.42 |.0154 | | |
|Total | 2,296,904.4250 |9 | | | | | |
| | | | | | | | |
| | | | | | | | |
|Regression output | | | |confidence interval | |
|variables | coefficients |std. error | t (df=8) |p-value |95% lower |95% upper |std. coeff. |
|Intercept |1,585.2209 |465.5205 | 3.405 |.0093 |511.7287 |2,658.7131 |0.000 |
|1997 |443 |1,834 |196,249 |3,363,556 |812,462 |- 405.5 |164,430.3 |
|1998 |757 |2,330 |573,049 |5,428,900 |1,763,810 |- 91.5 |8,372.25 |
|1999 |927 |3,386.5 |859,329 |11,468,382.25 |3,139,285.5 |78.5 |6,162.25 |
|2000 |696 |3,141.6 |484,416 |9,869,650.56 |2,186,553.6 |- 152.5 |23,256.25 |
|2001 |872 |3,196.6 |760,384 |10,218,251.56 |2,787,435.2 |23.5 |552.25 |
|2002 |835 |3,000.5 |697,225 |9,003,000.25 |2,505,417.5 |- 13.5 |182.25 |
|2003 |880 |3,330.1 |774,400 |11,089,566.01 |2,930,488 |31.5 |992.25 |
|2004 |1343 |3,345.7 |1,803,649 |11,193,708.49 |4,493,275.1 |494.5 |244,530.3 |
|2005 |959 |3,228.3 |919,681 |10,421,920.89 |3,095,939.7 |110.5 |12,210.25 |
|2006 |773 |2,906.2 |597,529 |8,445,998.44 |2,246,492.6 |- 75.5 |5700.25 |
|Totals |8,485 |29,699.5 |7,665,911 |90,502,934.45 |25,961,159.2 |0 |466,388.5 |
| |∑ X |∑ Y |∑ X2 |∑ Y2 |∑ XY |∑ (X-X mean)|∑ (X-X mean)2 |
Geography=Ontario
North American Industry Classification System (NAICS)=Software publishers
Summary statistics=Number of establishments (units)4
Summary statistics=Operating revenue (dollars)5
Source: Statistics Canada. Table 354-0005 - Summary statistics for software development and computer services (all establishments), by North American Industry Classification System (NAICS), annual, CANSIM (database), Using E-STAT (distributor).
http://estat.statcan.ca/cgi-win/cnsmcgi.exe'
Lang=E&ESTATFile=EStat\English\CII_1_E.htm&RootDir=ESTAT/
(accessed: July 27, 2008)
2. With the regression we have, we can predict within 95% that the range of numbers should fall within the predicted interval. Since our actual value of, $2,658.7131 ($ in millions) falls within the range of the prediction interval of, $1,963.684802 ($ in millions) and $3,729.787272 ($ in millions) in 2006, it is safe to say we can trust it.
EVALUATION OF REGRESSION MODEL
In preparing the regression model, I conclude that in comparing the number of Software Publisher Establishments (our independent variable X) to the amount of revenue made by these establishments (our dependent variable Y), as there are more Software Publisher Establishments, there is more revenue made per year. As there are less Software Publisher Establishments, there is less revenue made per year.
Now, because the coefficient of correlation is a positive number, they are directly proportionate to one another, as one increases or decreases, the other increases or decreases.
To make this simple linear regression equation into a multiple linear regression equation, we can add comparisons to the number of Software Publisher Establishments, to the amount of expenses for each establishment totaled per year. As well as, the total number of employee’s per year.
SUMMARY AND CONCLUSION
In doing this study of Simple Linear Regression, and taking our two variables, the number of Software Publisher Establishments (our independent variable X) and the amount of revenue made by these establishments (our dependent variable Y). We developed numerical measures to express the relationship between the two variables. Is the relationship strong' Yes, and through our equation we found a strong relationship between our dependent variable Y (the amount of revenue made by these establishments) and how it is affected by our independent variable X (the number of Software Publisher Establishments).
By examining the meaning and purpose of the correlation analysis, we have developed a scatter diagram to portray the relationship between the two variables. We also developed a mathematical equation that allowed us to estimate the value of one variable Y (our dependent variable, the amount of revenue made by these establishments) based on the value of another variable X (our independent variable, the number of Software Publisher Establishments).
This is called our regression analysis. We have determined the equation of the line that has best fit our data. We have estimated the value of one variable based on another. We also measured the error in our estimate, and established a prediction interval for our estimate.
Bibliography
Lind, Marshall, Wathen, Waite, Basic Statistics For Business & Economics –
Second Canadian Edition
Geography=Ontario
North American Industry Classification System (NAICS)=Software publishers
Summary statistics=Number of establishments (units)4
Source: Statistics Canada. Table 354-0005 - Summary statistics for software development and computer services (all establishments), by North American Industry Classification System (NAICS), annual, CANSIM (database), Using E-STAT (distributor).
http://estat.statcan.ca/cgi-win/cnsmcgi.exe'
Lang=E&ESTATFile=EStat\English\CII_1_E.htm&RootDir=ESTAT/
(accessed: July 27, 2008)
Geography=Ontario
North American Industry Classification System (NAICS)=Software publishers
Summary statistics=Operating revenue (dollars)5
Source: Statistics Canada. Table 354-0005 - Summary statistics for software development and computer services (all establishments), by North American Industry Classification System (NAICS), annual, CANSIM (database), Using E-STAT (distributor).
http://estat.statcan.ca/cgi-win/cnsmcgi.exe'
Lang=E&ESTATFile=EStat\English\CII_1_E.htm&RootDir=ESTAT/
(accessed: July 27, 2008)
APPENDICES
1. SCATTER DIAGRAM
[pic]
2. CALULATION OF REGRESSION – LEAST SQUARES METHOD
General form of Linear Regression Equation – Formula (12-4)
Y’ = a + bX a = 1,585.220891 b = 1.631973022
Y’ = 1,585.220891 + 1.631973022X
Slope of the Regression Line: Formula (12-5)
b = n(EXY) – (EX)(EY) b = 10(25,961,159.2) – (8,485)(29,699.5)
n(EX2) – (EX)2 10(7,665,911) – (71,995,225)
b = 7,611,334.5 b = 1.631973022
4,663,885
Y – Intercept – Formula (12-6)
a = EY – b EX a = 29,699.5 – 1.631973022 x 8485
n n 10 10
a = 2,969.95 – 1.631973022 x 848.5 a = 2,969.5 – 1,384.729109
a = 1,585.220891
3. CALULATION OF Syx – Formula (12-8)
Se = √ EY2 – a(EY) – b(EXY)
n – 2
Se = √ 90,502,934.45 – 1,585.220891 x 29,699.5 – 1.631973022 x 25,961,159.2
10 – 2
Se = √ 90,502,934.45 – 47,080,267.85 – 42,367,911.43
8
Se = √ 1,054,755.17
8
Se = √131,844.3963
Se = 363.1038368
4. CALULATION Correlation Coefficient r
Correlation Coefficient – Formula (12 – 2)
r = n(EXY) – (EX) (EY) .
√[n(EX2) – (EX)2] [n(EY2) – (EY)2
r = 10(25,961,159.2) – (8485) (29,699.5) .
√[10(7,665,911) – (71,995,225)] [10(90,502,934.45) – (882,060,300.3)
r = 259,611,592 – 252,000,257.5 .
√(76,659,110 – 71,995,225) x (905,029,344.5 – 882,060,300.3)
r = 7,611,334.5 .
√4,663,885 x 22,969,044.2
r = 7,611,334.5 .
√1.071249807 x (10 to 14)
r = 7,611,334.5
10,350,119.84
r = 0.735386122
Calculation Coefficient of Determination r2
r = 0.735386122
r2 = (0.735386122)2
r2 = 0.540792749 or 54.1 % .
5. PROCEDURES AND CALULATIONS FOR TESTS OF SIGNIFICANCE
(5 – Step Procedure)
1. Is the correlation coefficient significant' (Formula – (12 - 3)
1. Ho: p = 0
H1: p = 0
2. Significance level = 0.05
3. This is a two-tailed test
4. Using the .05 significance level, the decision rule states that if the computed value
of t falls in the area between plus 2.306 and minus 2.306 the null hypotheses is not
rejected.
5. The computed t-value does fall within the rejection region. Thus, H0 is rejected at the .05 significance level. This means that the correlation in the population is not zero. It indicates that that there is correlation with respect to the population.
df = 10-2=8 t=2.306
t = r √ n – 2 t = 0.735386122 √ 10 – 2 t = 0.735386122 x 2.828427125
√ 1 – r2 √ 1 – 0.540792749 √ 0.459207251
t = 0.2.079986055 t = 3.069418135 t = 3.069
0.677648324
2. Is the Overall regression model valid' (Formula – (13 – 4)
1. Ho: B1 = 0 – our null hypothesis
H1: The B is not 0 – our alternative hypothesis
2. Significance level = 0.05
3. This is a “Global Test” – critical value of F = 5.32.
4. Using the .05 significance level, the decision rule states that if the computed value
of F is less than or equal to the critical value of 5.32, the null hypotheses is not
rejected.
5. As we can see, our computed value of F = 9.421327669, is greater than our critical
value of F = 5.32. Therefore, the null hypothesis of Ho is rejected, and we accept the
alternative hypothesis, H1.
F = SSR / K . F = 1,242,149.2569 / 1 .
SSE / (n – k – 1) 1,054,755.1681 / (10 – 1 – 1)
F = 1,242,149.2569 / 1. F = 1,242,149.2569 .
1,054,755.1681 / 8 131,844.396
F = 9.421327669
3. Is the regression coefficient significant' (Formula – (13 – 5)
1. Ho: B1 = 0
H1: B1 ≠0
2. Significance level = 0.05
3. This is a “Global Test”
4. Given the significance level of .05, if the independent variable has a p-value less than
the significance level, the null hypotheses would be rejected.
5. Therefore, because the p- value of .0154 (t = 2.306) is less than 3.069 we can conclude
that the regression coefficient is significant to the regression model.
t = b1 – 0 t = 3.069349298
Sb1
t = b1 – 0 t = 1.631973022 – 0 t = 3.069349298
Sb1 0.5317
1. Calculation of a 95% prediction interval for your predicted value. (Formula – (12 – 10)
Y’ + tSe √ 1 + 1 + (X – X)2
n ∑(X-X)2
Y’ + (2.306 x 363.1038368) √ 1 + 1 + (773 – 848.5)2
10 466,388.5
Y’ + (837.3174477) √ 1 + 0.1 + (-75.5)2
466,388.5
Y’ + (837.3174477) √ 1 + 0.1 + 5700.25
466,388.5
Y’ + (837.3174477) √ 1 + 0.1 + 0.012222106
Y’ + (837.3174477) √ 1.112222106
Y’ + 837.3174477 x 1.054619413
Y’ + 883.0512351
Y’ = a + bX
Y’ = 1,585.220891 + 1.631973022X
Y’ = 1,585.220891 + 1.631973022 x 773
Y’ = 1,585.220891 + 1,261.515146
Y’ = 2,846.736037
2,846.736037 + 883.051351
2,846.736037 + 883.051351 = 3,729.787272
2,846.736037 - 883.051351 = 1,963.684802

