Friday, December 19, 2008

Research Skills: Week 87

Interpreting Regression Analysis generated from Excel

Learn
Data generated by SPSS would be used to interpret Regression Analysis.

Unlearn
There are slight difference between SPSS output vs Excel output.

Relearn
The result from Excel would look like the below.

Regression Statistics
Multiple R: 0.807582
R Square: 0.652189
Adjusted R Square: 0.623205
Standard Error: 6.686129
Observations: 40

The first part of the output is the regression statistics. These are standard statistics which are given by most programs.

Multiple R – in a bivariate regression this is the absolute value of the correlation coefficient r. In a multivariate regression it is the square root of R2
R-Square – a measure of association that gives us an indication of the fit of the model. R Square ranges from 0 (nothing explained by the model) to 1 (a perfect fit). Like correlation it is based a linear fit of the data.
Adjusted R Square – R Square will always increase as you add independent variables to a model. To adjust for this to make it a better indicator of the fit of a model, the adjusted R Square adjusts the figure for the number of independent variables in the model.

Standard Error – This is an overall standard error of the model and is used in calculating the standard error of the coefficients in the model. The standard error is the square root of the MSE, which will be discussed in a later section.
Observations – the number of observations in the model.

The second part of the output is known as the ANOVA Table. ANOVA stands for ANalysis Of VAriance and it contains the breakdown of the Total Sum of Squares (TSS) into a part due to the regression model (RSS) and a part due to the residuals or error term (SSE). It provides an overall significance test for the model based on a sample of data. We will also have to deal with degrees of freedom.

Regression Sum of Squares (RSS) - The sum of squares due to the fit of the model. The degrees of freedom for regression is equal to the number of independent variables in the model and is denoted by k. The Mean Square due to Regression in the next column is equal to the Sum of Squares divided by the degrees of freedom.

Residual or Sum of Squares Error (SSE) - this is the part of the Total Sum of Squares that is unexplained by the model. The degrees of freedom for the SSE is equal the sample size (n) minus 1 minus the degrees of freedom for regression: n - 1 - k. The Mean Square Error (MSE) is equal to the SSE divided by its degrees of freedom.The MSE is the variance of the model.

Total Sum of Squares (TSS) - This is the numerator of the Variance, the total squared deviations of each value from the mean. The degrees of freedom for TSS is n - 1.The next stage is the coefficients. It gives the coefficient for each parameter, including the intercept (the constant).

The standard errors, and the t-values follow (the t-value is the coefficient divided by the standard error). Next comes the p-value associated with the variable, and the confidence intervals of the parameter estimates

F - The F-value is the ratio of two variances. In this case it is the ratio of The Mean Square due to Regression divided by the Mean Square Error. The F- distribution is a probability distribution with two degrees of freedom. The essence of the test is whether the ratio is nearly equal to one in a probability sense. A ratio of one (or close to one based on a sample) would imply that the model was a poor fit and there is no relationship of any of the independent variables with the dependent variable.

Significance F - The significance level associated with the F-value is the level of alpha to reject a null hypothesis that the model is a poor fit (all the coefficients for the independent variables are equal to zero). In most cases we are looking for a significance level of less than .05 in order to conclude that the model has something to offer in terms of explaining the dependent variable.
Note that Excel uses scientific notation, by default, so when it says 2.22E-08 it means, 2.22 * 10-8 . (i.e. 0.0000000222).

The last part of the Excel regression output gives the coefficients estimated by the regression model; a standard error for each coefficient so we can make an inference in a hypothesis test or confidence interval; a t-test based on a null hypothesis that the coefficient is really zero; and the significance level of this test.

The Coefficients - the regression estimates an intercept term and a slope coefficient(s) for the model. Excel assumes these are based on a sample. For this model the intercept is 209.77 and the slope is 68.587. This means that if no catalogs are sent we expect sales of $209.77, and for each catalog shipped we expect additional sales of $68.59. The intercept may not always make sense by itself (it may be out of the range of the data).

Standard Error - the standard error is the standard deviation of the sampling distribution for the regression coefficients. Just like with the mean, it is the spread of a repeated samples of size 1000 used to estimate the same model. Once again we will rely on the statisticians to calculate the standard error. All we need to do is to know how to use the standard error in a hypothesis test or a confidence interval.

t-Statistic - the t-statistic is based on a null hypothesis that the coefficient is really zero. Particularly in the case of the slope coefficient

P-value - The p-value shows the level of significance for the test. It is based on a two-tailed test and shows the probability of finding a sample with an estimated slope given from the model when the real value is zero. A low pvalue (usually less than .05) means we can have confidence in rejecting the null hypothesis and in concluding there is a relationship.
Confidence Intervals - Excel also puts a confidence interval around our estimates. You can specify the level of the confidence interval, but the default is a 95% C.I.

No comments: