In the preceding sections numerous concepts were introduced and illustrated, but the analysis was broken into disjoint pieces by sections. In this section we will go through a complete example of the use of correlation and regression analysis of data from start to finish, touching on all the topics of this chapter in sequence.
In general educators are convinced that, all other factors being equal, class attendance has a significant bearing on course performance. To investigate the relationship between attendance and performance, an education researcher selects for study a multiple section introductory statistics course at a large university. Instructors in the course agree to keep an accurate record of attendance throughout one semester. At the end of the semester 26 students are selected a random. For each student in the sample two measurements are taken: x, the number of days the student was absent, and y, the student’s score on the common final exam in the course. The data are summarized in Table 10.4 "Absence and Score Data".
Table 10.4 Absence and Score Data
Absences | Score | Absences | Score |
---|---|---|---|
x | y | x | y |
2 | 76 | 4 | 41 |
7 | 29 | 5 | 63 |
2 | 96 | 4 | 88 |
7 | 63 | 0 | 98 |
2 | 79 | 1 | 99 |
7 | 71 | 0 | 89 |
0 | 88 | 1 | 96 |
0 | 92 | 3 | 90 |
6 | 55 | 1 | 90 |
6 | 70 | 3 | 68 |
2 | 80 | 1 | 84 |
2 | 75 | 3 | 80 |
1 | 63 | 1 | 78 |
A scatter plot of the data is given in Figure 10.13 "Plot of the Absence and Exam Score Pairs". There is a downward trend in the plot which indicates that on average students with more absences tend to do worse on the final examination.
Figure 10.13 Plot of the Absence and Exam Score Pairs
The trend observed in Figure 10.13 "Plot of the Absence and Exam Score Pairs" as well as the fairly constant width of the apparent band of points in the plot makes it reasonable to assume a relationship between x and y of the form
$$y={\mathit{\beta}}_{1}x+{\mathit{\beta}}_{0}+\mathit{\epsilon}$$where ${\mathit{\beta}}_{1}$ and ${\mathit{\beta}}_{0}$ are unknown parameters and ε is a normal random variable with mean zero and unknown standard deviation σ. Note carefully that this model is being proposed for the population of all students taking this course, not just those taking it this semester, and certainly not just those in the sample. The numbers ${\mathit{\beta}}_{1}$, ${\mathit{\beta}}_{0}$, and σ are parameters relating to this large population.
First we perform preliminary computations that will be needed later. The data are processed in Table 10.5 "Processed Absence and Score Data".
Table 10.5 Processed Absence and Score Data
x | y | x^{2} | xy | y^{2} | x | y | x^{2} | xy | y^{2} |
---|---|---|---|---|---|---|---|---|---|
2 | 76 | 4 | 152 | 5776 | 4 | 41 | 16 | 164 | 1681 |
7 | 29 | 49 | 203 | 841 | 5 | 63 | 25 | 315 | 3969 |
2 | 96 | 4 | 192 | 9216 | 4 | 88 | 16 | 352 | 7744 |
7 | 63 | 49 | 441 | 3969 | 0 | 98 | 0 | 0 | 9604 |
2 | 79 | 4 | 158 | 6241 | 1 | 99 | 1 | 99 | 9801 |
7 | 71 | 49 | 497 | 5041 | 0 | 89 | 0 | 0 | 7921 |
0 | 88 | 0 | 0 | 7744 | 1 | 96 | 1 | 96 | 9216 |
0 | 92 | 0 | 0 | 8464 | 3 | 90 | 9 | 270 | 8100 |
6 | 55 | 36 | 330 | 3025 | 1 | 90 | 1 | 90 | 8100 |
6 | 70 | 36 | 420 | 4900 | 3 | 68 | 9 | 204 | 4624 |
2 | 80 | 4 | 160 | 6400 | 1 | 84 | 1 | 84 | 7056 |
2 | 75 | 4 | 150 | 5625 | 3 | 80 | 9 | 240 | 6400 |
1 | 63 | 1 | 63 | 3969 | 1 | 78 | 1 | 78 | 6084 |
Adding up the numbers in each column in Table 10.5 "Processed Absence and Score Data" gives
$$\mathrm{\Sigma}x=71,\text{\hspace{1em}}\mathrm{\Sigma}y=2001,\text{\hspace{1em}}\mathrm{\Sigma}{x}^{2}=329,\text{\hspace{1em}}\mathrm{\Sigma}xy=4758,\text{\hspace{1em}}\text{\hspace{0.17em}and}\text{\hspace{1em}}\mathrm{\Sigma}{y}^{2}=161511.$$Then
$$\begin{array}{lll}\hfill S{S}_{xx}& =& \mathrm{\Sigma}{x}^{2}-\frac{1}{n}{\left(\mathrm{\Sigma}x\right)}^{2}=329-\frac{1}{26}{\left(71\right)}^{2}=135.1153846\\ \hfill S{S}_{xy}& =& \mathrm{\Sigma}xy-\frac{1}{n}\left(\mathrm{\Sigma}x\right)\left(\mathrm{\Sigma}y\right)=4758-\frac{1}{26}\left(71\right)\left(2001\right)=\text{\u2212}706.2692308\\ \hfill S{S}_{yy}& =& \mathrm{\Sigma}{y}^{2}-\frac{1}{n}{\left(\mathrm{\Sigma}y\right)}^{2}=161511-\frac{1}{26}{\left(2001\right)}^{2}=7510.961538\end{array}$$and
$$\stackrel{-}{x}=\frac{\mathrm{\Sigma}x}{n}=\frac{71}{26}=2.730769231\text{\hspace{1em}}\text{\hspace{0.17em}and}\text{\hspace{1em}}\stackrel{-}{y}=\frac{\mathrm{\Sigma}y}{n}=\frac{2001}{26}=76.96153846$$We begin the actual modelling by finding the least squares regression line, the line that best fits the data. Its slope and y-intercept are
$$\begin{array}{lll}\hfill {\widehat{\mathit{\beta}}}_{1}& =& \frac{S{S}_{xy}}{S{S}_{xx}}=\frac{\text{\u2212}706.2692308}{135.1153846}=\text{\u2212}5.227156278\\ \hfill {\widehat{\mathit{\beta}}}_{0}& =& \stackrel{-}{y}-{\widehat{\mathit{\beta}}}_{1}\stackrel{-}{x}=76.96153846-\left(\text{\u2212}5.227156278\right)\left(2.730769231\right)=91.23569553\end{array}$$Rounding these numbers to two decimal places, the least squares regression line for these data is
$$\widehat{y}=\text{\u2212}5.23\text{\hspace{0.17em}}x+91.24.$$The goodness of fit of this line to the scatter plot, the sum of its squared errors, is
$$SSE=S{S}_{yy}-{\widehat{\mathit{\beta}}}_{1}S{S}_{xy}=7510.961538-\left(\text{\u2212}5.227156278\right)\left(\text{\u2212}706.2692308\right)=3819.181894$$This number is not particularly informative in itself, but we use it to compute the important statistic
$${s}_{\mathit{\epsilon}}=\sqrt{\frac{SSE}{n\text{\u2212}2}}=\sqrt{\frac{3819.181894}{24}}=12.11988495$$The statistic ${s}_{\mathit{\epsilon}}$ estimates the standard deviation σ of the normal random variable ε in the model. Its meaning is that among all students with the same number of absences, the standard deviation of their scores on the final exam is about 12.1 points. Such a large value on a 100-point exam means that the final exam scores of each sub-population of students, based on the number of absences, are highly variable.
The size and sign of the slope ${\widehat{\mathit{\beta}}}_{1}=\text{\u2212}5.23$ indicate that, for every class missed, students tend to score about 5.23 fewer points lower on the final exam on average. Similarly for every two classes missed students tend to score on average $2\times 5.23=10.46$ fewer points on the final exam, or about a letter grade worse on average.
Since 0 is in the range of x-values in the data set, the y-intercept also has meaning in this problem. It is an estimate of the average grade on the final exam of all students who have perfect attendance. The predicted average of such students is ${\widehat{\mathit{\beta}}}_{0}=91.24.$
Before we use the regression equation further, or perform other analyses, it would be a good idea to examine the utility of the linear regression model. We can do this in two ways: 1) by computing the correlation coefficient r to see how strongly the number of absences x and the score y on the final exam are correlated, and 2) by testing the null hypothesis ${H}_{0}:{\mathit{\beta}}_{1}=0$ (the slope of the population regression line is zero, so x is not a good predictor of y) against the natural alternative ${H}_{a}:{\mathit{\beta}}_{1}<0$ (the slope of the population regression line is negative, so final exam scores y go down as absences x go up).
The correlation coefficient r is
$$r=\frac{S{S}_{xy}}{\sqrt{S{S}_{xx}S{S}_{yy}}}=\frac{\text{\u2212}706.2692308}{\sqrt{\left(135.1153846\right)\left(7510.961538\right)}}=\text{\u2212}0.7010840977$$a moderate negative correlation.
Turning to the test of hypotheses, let us test at the commonly used 5% level of significance. The test is
$$\begin{array}{l}\text{\hspace{1em}}\text{\hspace{0.17em}}{H}_{0}:{\mathit{\beta}}_{1}=0\\ \text{vs}.\text{\hspace{0.17em}}{H}_{a}:{\mathit{\beta}}_{1}<0\text{\hspace{1em}}@\text{\hspace{0.17em}}\alpha =0.05\end{array}$$From Figure 12.3 "Critical Values of ", with $df=26-2=24$ degrees of freedom ${t}_{0.05}=1.711$, so the rejection region is $\left(\text{\u2212}\infty ,\text{\u2212}1.711\right].$ The value of the standardized test statistic is
$$t=\frac{{\widehat{\mathit{\beta}}}_{1}-{B}_{0}}{{s}_{\mathit{\epsilon}}\u2215\sqrt{S{S}_{xx}}}=\frac{\text{\u2212}5.227156278-0}{12.11988495\u2215\sqrt{135.1153846}}=\text{\u2212}5.013$$which falls in the rejection region. We reject H_{0} in favor of H_{a}. The data provide sufficient evidence, at the 5% level of significance, to conclude that ${\mathit{\beta}}_{1}$ is negative, meaning that as the number of absences increases average score on the final exam decreases.
As already noted, the value ${\mathit{\beta}}_{1}=\text{\u2212}5.23$ gives a point estimate of how much one additional absence is reflected in the average score on the final exam. For each additional absence the average drops by about 5.23 points. We can widen this point estimate to a confidence interval for ${\mathit{\beta}}_{1}.$ At the 95% confidence level, from Figure 12.3 "Critical Values of " with $df=26-2=24$ degrees of freedom, ${t}_{\alpha \u22152}={t}_{0.025}=2.064.$ The 95% confidence interval for ${\mathit{\beta}}_{1}$ based on our sample data is
$${\widehat{\mathit{\beta}}}_{1}\pm {t}_{\alpha \u22152}\frac{{s}_{\mathit{\epsilon}}}{\sqrt{S{S}_{xx}}}=\text{\u2212}5.23\pm 2.064\text{\hspace{0.17em}}\frac{12.11988495}{\sqrt{135.1153846}}=\text{\u2212}5.23\pm 2.15$$or $\left(\text{\u2212}7.38,\text{\u2212}3.08\right).$ We are 95% confident that, among all students who ever take this course, for each additional class missed the average score on the final exam goes down by between 3.08 and 7.38 points.
If we restrict attention to the sub-population of all students who have exactly five absences, say, then using the least squares regression equation $\widehat{y}=\text{\u2212}5.23x+91.24$ we estimate that the average score on the final exam for those students is
$$\widehat{y}=\text{\u2212}5.23\left(5\right)+91.24=65.09$$This is also our best guess as to the score on the final exam of any particular student who is absent five times. A 95% confidence interval for the average score on the final exam for all students with five absences is
$$\begin{array}{lll}\hfill {\widehat{y}}_{p}\pm {t}_{\alpha \u22152}{s}_{\mathit{\epsilon}}\sqrt{\frac{1}{n}+\frac{{\left({x}_{p}-\stackrel{-}{x}\right)}^{2}}{S{S}_{xx}}}& =& 65.09\pm \left(2.064\right)\left(12.11988495\right)\sqrt{\frac{1}{26}+\frac{{\left(5-2.730769231\right)}^{2}}{135.1153846}}\\ \hfill & =& 65.09\pm 25.01544254\sqrt{0.0765727299}\\ \hfill & =& 65.09\pm 6.92\end{array}$$which is the interval $\left(58.\mathrm{17,72}.01\right).$ This confidence interval suggests that the true mean score on the final exam for all students who are absent from class exactly five times during the semester is likely to be between 58.17 and 72.01.
If a particular student misses exactly five classes during the semester, his score on the final exam is predicted with 95% confidence to be in the interval
$$\begin{array}{lll}\hfill {\widehat{y}}_{p}\pm {t}_{\alpha \u22152}{s}_{\mathit{\epsilon}}\sqrt{1+\frac{1}{n}+\frac{{\left({x}_{p}-\stackrel{-}{x}\right)}^{2}}{S{S}_{xx}}}& =& 65.09\pm 25.01544254\sqrt{1.0765727299}\\ \hfill & =& 65.09\pm 25.96\end{array}$$which is the interval $\left(39.\mathrm{13,91}.05\right).$ This prediction interval suggests that this individual student’s final exam score is likely to be between 39.13 and 91.05. Whereas the 95% confidence interval for the average score of all student with five absences gave real information, this interval is so wide that it says practically nothing about what the individual student’s final exam score might be. This is an example of the dramatic effect that the presence of the extra summand 1 under the square sign in the prediction interval can have.
Finally, the proportion of the variability in the scores of students on the final exam that is explained by the linear relationship between that score and the number of absences is estimated by the coefficient of determination, r^{2}. Since we have already computed r above we easily find that
$${r}^{2}={\left(\text{\u2212}0.7010840977\right)}^{2}=0.491518912$$or about 49%. Thus although there is a significant correlation between attendance and performance on the final exam, and we can estimate with fair accuracy the average score of students who miss a certain number of classes, nevertheless less than half the total variation of the exam scores in the sample is explained by the number of absences. This should not come as a surprise, since there are many factors besides attendance that bear on student performance on exams.
The exercises in this section are unrelated to those in previous sections.
The data give the amount x of silicofluoride in the water (mg/L) and the amount y of lead in the bloodstream (μg/dL) of ten children in various communities with and without municipal water. Perform a complete analysis of the data, in analogy with the discussion in this section (that is, make a scatter plot, do preliminary computations, find the least squares regression line, find $SSE$, ${s}_{\mathit{\epsilon}}$, and r, and so on). In the hypothesis test use as the alternative hypothesis ${\mathit{\beta}}_{1}>0$, and test at the 5% level of significance. Use confidence level 95% for the confidence interval for ${\mathit{\beta}}_{1}.$ Construct 95% confidence and predictions intervals at ${x}_{p}=2$ at the end.
$$\begin{array}{cccccc}x& 0.0& 0.0& 1.1& 1.4& 1.6\\ y& 0.3& 0.1& 4.7& 3.2& 5.1\end{array}$$ $$\begin{array}{cccccc}x& 1.7& 2.0& 2.0& 2.2& 2.2\\ y& 7.0& 5.0& 6.1& 8.6& 9.5\end{array}$$The table gives the weight x (thousands of pounds) and available heat energy y (million BTU) of a standard cord of various species of wood typically used for heating. Perform a complete analysis of the data, in analogy with the discussion in this section (that is, make a scatter plot, do preliminary computations, find the least squares regression line, find $SSE$, ${s}_{\mathit{\epsilon}}$, and r, and so on). In the hypothesis test use as the alternative hypothesis ${\mathit{\beta}}_{1}>0$, and test at the 5% level of significance. Use confidence level 95% for the confidence interval for ${\mathit{\beta}}_{1}.$ Construct 95% confidence and predictions intervals at ${x}_{p}=5$ at the end.
$$\begin{array}{cccccc}x& 3.37& 3.50& 4.29& 4.00& 4.64\\ y& 23.6& 17.5& 20.1& 21.6& 28.1\end{array}$$ $$\begin{array}{cccccc}x& 4.99& 4.94& 5.48& 3.26& 4.16\\ y& 25.3& 27.0& 30.7& 18.9& 20.7\end{array}$$Large Data Sets 3 and 3A list the shoe sizes and heights of 174 customers entering a shoe store. The gender of the customer is not indicated in Large Data Set 3. However, men’s and women’s shoes are not measured on the same scale; for example, a size 8 shoe for men is not the same size as a size 8 shoe for women. Thus it would not be meaningful to apply regression analysis to Large Data Set 3. Nevertheless, compute the scatter diagrams, with shoe size as the independent variable (x) and height as the dependent variable (y), for (i) just the data on men, (ii) just the data on women, and (iii) the full mixed data set with both men and women. Does the third, invalid scatter diagram look markedly different from the other two?
http://www.gone.2012books.lardbucket.org/sites/all/files/data3.xls
http://www.gone.2012books.lardbucket.org/sites/all/files/data3A.xls
Separate out from Large Data Set 3A just the data on men and do a complete analysis, with shoe size as the independent variable (x) and height as the dependent variable (y). Use $\alpha =0.05$ and ${x}_{p}=10$ whenever appropriate.
http://www.gone.2012books.lardbucket.org/sites/all/files/data3A.xls
Separate out from Large Data Set 3A just the data on women and do a complete analysis, with shoe size as the independent variable (x) and height as the dependent variable (y). Use $\alpha =0.05$ and ${x}_{p}=10$ whenever appropriate.
http://www.gone.2012books.lardbucket.org/sites/all/files/data3A.xls
$\mathrm{\Sigma}x}=14.2$, $\mathrm{\Sigma}y}=49.6$, $\mathrm{\Sigma}xy}=91.73$, $\mathrm{\Sigma}{x}^{2}}=26.3$, $\mathrm{\Sigma}{y}^{2}}=333.86.$
$S{S}_{xx}=6.136$, $S{S}_{xy}=21.298$, $S{S}_{yy}=87.844.$
$\stackrel{-}{x}=1.42$, $\stackrel{-}{y}=4.96.$
${\widehat{\mathit{\beta}}}_{1}=3.47$, ${\widehat{\mathit{\beta}}}_{0}=0.03.$
$SSE=13.92.$
${s}_{\mathit{\epsilon}}=1.32.$
r = 0.9174, r^{2} = 0.8416.
$df=8$, T = 6.518.
The 95% confidence interval for ${\mathit{\beta}}_{1}$ is: $\left(2.\mathrm{24,4}.70\right).$
At ${x}_{p}=2$, the 95% confidence interval for $E\left(y\right)$ is $\left(5.\mathrm{77,8}.17\right).$
At ${x}_{p}=2$, the 95% prediction interval for y is $\left(3.\mathrm{73,10}.21\right).$
The positively correlated trend seems less profound than that in each of the previous plots.
The regression line: $\widehat{y}=3.3426x+138.7692.$ Coefficient of Correlation: r = 0.9431. Coefficient of Determination: r^{2} = 0.8894. $SSE=283.2473.$ ${s}_{e}=1.9305.$ A 95% confidence interval for ${\mathit{\beta}}_{1}$: $\left(3.\mathrm{0733,3}.6120\right).$ Test Statistic for ${H}_{0}:{\mathit{\beta}}_{1}=0$: T = 24.7209. At ${x}_{p}=10$, $\widehat{y}=172.1956$; a 95% confidence interval for the mean value of y is: $\left(171.\mathrm{5577,172}.8335\right)$; and a 95% prediction interval for an individual value of y is: $\left(168.\mathrm{2974,176}.0938\right).$