IT 223 -- Mar 9, 2026

IT 223 -- Mar 11 , 2026

Review Exercises

A company wants to evaluate a new training program for its sales team. They test 6 employees before the training and again after the training to see if there is a significant improvement in their scores. Here are the before and after scores:
```
Before: 71 80 75 85 93 69
After:  75 82 73 89 96 75
```
What kind of t-test should you use? Conduct the test by hand using a calculator or R to do the arithmetic. Then use the R t.test function to verify your calculations.
Answer: Here are the five steps of the paired sample t-test:
1. First compute the differences after - before:
```
> setwd("c:/workspace")
> before <- scan( )
1: 71 80 75 85 93 69
7:
Read 6 items
> after <- scan( )
1: 75 82 73 89 96 75
7:
Read 6 items
> diff <- after - before
> diff
[1] 4 2 -2 4 3 6
> mean(diff)
[1] 2.833333
> sd(diff)
[1] 2.71416
```
  State the null and alternative hypotheses:
  H₀: diff = 0 H₁: diff ≠ 0
2. Compute the test statistic:
  t = (diff - μ) / SD_diff/√n = (2.833333 - 0) / (2.71416/√6) = 2.557042
3. Write down a 0.95 confidence interval using the t-table. Use the column 0.025 upper tail probability and the row n - 1 = 6 - 1 = 5 degrees of freedom: I = (-2.570, -2.570).
4. t ∈ I, so accept null hypotheses, we do not have enough evidence to reject the null hypotheses.
5. Let R compute the p-value. Recall that if p < 0.05, reject H₀; if p ≥ 0.05, accept H₀. Here is the R output using the t.test function:
```
> t.test(after, before, paired=TRUE)

        Paired t-test

data:  after and before
t = 2.557, df = 5, p-value = 0.05083
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -0.01500332  5.68166999
sample estimates:
mean difference 
       2.833333 
```
  Notice that the t statistic and degrees of freedom match our calculations. Also, the p-value = 0.05083 > 0.05, so accept H₀.
  Here is the output from the one-sample t-test using diff as the statistic:
```
> t.test(diff, mu=0)

        One Sample t-test

data:  diff
t = 2.557, df = 5, p-value = 0.05083
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -0.01500332  5.68166999
sample estimates:
mean of x 
 2.833333 
```
  We obtain the same result as we did with the paired 2-sample t-test.
The blood alchohol level for a random sample of college students is tested after they drink a few beers. The data file beer-bac.txt contains two columns: (a) the number of beers (beers) consumed and (b) their blood alchohol levels (bac) after they drink the beers. Use R to obtain the regression model:
```
> model1 <- lm(bac ~ beers, data=df1)
```
Analyze the regression model and graphs of the resulting data.
1. Create the scatter plot of bac vs. beers.
```
setwd("c:/workspace")
> df1 <- read.csv("beer-bac.txt")
> df1
beers bac
1 5 0.100
2 2 0.030
3 9 0.190
4 8 0.120
5 3 0.040
6 7 0.095
7 3 0.070
8 5 0.060
9 3 0.020
10 5 0.050
11 4 0.070
12 6 0.100
13 5 0.085
14 7 0.090
15 1 0.010
16 4 0.050
> plot(df1$beers, df1$bac, xlab="Number of Beers", 
+ ylab="Blood Alcohol concentration")
```
  The scatterplot shows the positive linear relationship between the dependent and the independent variable. It also shows that the data form an ellipse-shaped bivariate normal point cloud.
2. Find the linear regression equation for predicting bac from beers. Answer:
```
> model1 <- lm(bac ~ beers, data=df1)
> summary(model1)

Call:
lm(formula = bac ~ beers, data = df1)

Residuals:
      Min        1Q   Median       3Q      Max 
-0.027118 -0.017350 0.001773 0.008623 0.041027 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|) 
(Intercept) -0.012701   0.012638   -1.005    0.332 
beers        0.017964   0.002402    7.480 2.97e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.02044 on 14 degrees of freedom
Multiple R-squared: 0.7998, Adjusted R-squared: 0.7855 
F-statistic: 55.94 on 1 and 14 DF, p-value: 2.969e-06
```
  The regression equation is
  bac = 0.017964 * beers - 0.012701
3. Find the R-squared value for this equation. Interpret it. Answer:
  Multiple R-squared: 0.7998. This means that about 80% of the variability of the dependent variable (bac) is due to the variability of the independent variable (beers).
4. Create the boxplot of the residuals.
```
> r <- resid(model1)
> boxplot(r, xlab="Residuals", main="Boxplot of Residuals")
```
  The boxplot shows no outliers.
5. Create the scatterplot of the residuals vs. the predicted values. Interpret it. Answer:
```
> p <- predict(model1)
> plot(p, r, xlab="Predicted Values", ylab="Residuals", 
+ main="Residual Plot")
```
  The residuals are fairly unbiased and homoscedastic.
6. Create the normal plot of the residuals. Interpret it. Answer:
```
> qqline(r)
> qqline(r, col="red")
```
  The residuals in the normal plot are fairly close to a straight line, so they are approximately normally distributed.
7. If y = ax + b. Perform a t-test that tests the null hypothesis that the true value of the slope a is 0.
  Answer: Look at the p-value for testing that the slope coefficient a is zero. Because the corresponding p-value is very small, and certainly less than 0.05, we reject the null hypothesis that a = 0.
8. If y = ax + b. Perform a t-test that tests the null hypothesis that the true value of the intercept b is 0.
  Answer: the p-value for testing whether b is 0, is greater than 0.05 so we do not have anough evidence to concluse that b = 0.
9. For this example, if the number of beers consumed is 4, what is the predicted blood alcohol level?
  Answer: substitute 4 for the variable beers in the regression equation.
  bac = 0.017964 * beers - 0.012701 = 0.017964 * 4 - 0.012701 = 0.059155.

Look at the dataset multi-reg-sales.txt, which contains three columns: TV Advertising (tv), Radio Advertising (radio), and Sales (sales). The units for all three variables are thousands of dollars.

Use R to find the multiple regression equation for predicting sales from TV and radio advertising:

model2 = lm(sales~tv+radio, data=df2)

> # Answer:
> setwd("c:/workspace")
> df2 <- read.csv("multi-reg-sales.txt")
> df2
tv radio sales
1 230.1 37.8 22.1
2  44.5 39.3 10.4
3  17.2 45.9  9.3
4 151.5 41.3 18.5
5 180.8 10.8 12.9
6   8.7 48.9  7.2
7  57.5 32.8 11.8
8 120.2 19.6 14.8
> model2 <- lm(sales~tv+radio, data=df2)
> summary(model2)

Call:
lm(formula = sales ~ tv + radio, data = df2)

Residuals:
       1       2        3       4        5
-0.11933 0.10117 -0.09692 0.94653 -2.49597 
       6       7        8 
-2.04224 1.51978  2.18698 

Coefficients:
Estimate Std. Error t value Pr(>|t|) 
(Intercept) 2.14721 3.00413 0.715 0.50674 
tv 0.06531 0.01041 6.275 0.00151 **
radio 0.13347 0.06474 2.062 0.09421 . 
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.92 on 5 degrees of freedom
Multiple R-squared: 0.8921, Adjusted R-squared: 0.8489 
F-statistic: 20.67 on 2 and 5 DF, p-value: 0.003826

Use R to create the residual plot and normal plot of residuals. Answer:
```
> p <- predict(model2)
> r <- resid(model2)
> plot(p, r, xlab="Predicted Values", ylab="Residuals",
+ main="Residual Plot")
> qqnorm(r)
> qqline(r)
```
The residual plot shows that the residuals are fairly unbiased and homoscedastic, as much as can be determined from a small dataset. The normal plot shows that the residuals are fairly normally distributed because the points generally follow a straight line.
Use the resulting regression equation to predict sales when tv=80 and radio=30. Answer:
       sales = 0.06531 * tv + 0.13347 * radio + 2.14721
               = 0.06531 * 80 + 0.13347 * 30 + 2.14721
               = 11.3761

Review for Final Exam

Look at the material on the Exam Info Page.