Sept 14, 2016

To Notes

IT 403 -- Sept 14, 2016

Review Problems

What is the proper filename for project submissions?
Ans: proj1-smithx.doc, where you replace Smith by your last name.
Match these names with the definitions below:

Cotes de Moivre Fisher Galton Gauss Graunt Pascal Quetelet Tukey
1. First Stated the Central Limit Theorem
2. Tried to find the "ideal man" that nature was trying to produce
3. Was the first demographer
4. Coined the term "exploratory data analysis"
5. Father of modern statistics
6. First described the least squares method for fitting a line to data
7. First studied the theory of errors in astronomy
8. First applied the theory of probability to gambling
9. First introduced the concept of correlation for a bivariate dataset
Ans: Cotes: g; deMoivre: a; Fisher: e; Galton: i; Gauss: f; Graunt: c; Pascal: h; Quetelet: b; Tukey: d.
What do the terms inner fence and outer fence mean?
Ans: The two inner fences are located at Q1 - 1.5 x IQR and Q3 + 1.5 x IQR. The two outer fences are located at Q1 - 3.0 x IQR and Q3 + 3.0 x IQR. Extreme outliers are located to the outside of the outer fences. Mild outliers are located between the inner and outer fences.
Why are outliers important?
Ans: Outliers could represent erroneous data, in which case they must be corrected or omitted. If correct, they might be the most important data points in the dataset. In business they can significantly affect the bottom line; in science, they might be the key to a scientific breakthrough.
Draw the histogram in each case. Note: [a,b) denotes an interval that is closed on the left (includes a) and open on the right (does not include b).

Caution: what does it mean for histograms (b) and (c) to have bins of different widths?

(a)

Bin Count

[0,1) 1

[1,2) 3

[2,3) 5

[3,4] 1

(b)

Bin Count

[0,1) 3

[1,2) 5

[2,4] 2

(c)

Bin Count

[0,1) 2

[1,2) 4

[2,2.5) 3

[2.5,3] 1
Compute the median for each histogram in the preceding problem by using interpolation in the bar that contains the median. Ans: Problems 5 and 6.
Compute the interquartile range of the histogram of Problem 12c by using interpolation in the bars that contain Q1 (25th percentile) and Q3 (75th percentile).
Ans: Problem 7.
In Histogram (3), use interpolation to estimate the percentage of observations in the interval [0.5, 3.0).
Ans: The area of the histogram rectangle over the interval [0.5, 1.0) has half of the area of the histogram rectangle over the interval [0,1), which is 30%. Also, the histogram area of the rectangle over the interval [2, 3.0) has half of the area of the area of the histogram rectangle over the area of the histogram area over the interval [2, 4), which is 10%. Therefore:
Draw the histogram without bar lines of
1. the incomes of all persons in the U. S.
  Ans: A skewed histogram with a peak at about 35 or 40 thousand, but with a long right tail that extends all the way past 1 billion.
2. the GPAs of all students at DePaul. Ans: A bell-shaped histogram with peak around 3.0. There may be a secondary peak around 2.0, representing those students that have just come off of academic probation. The height of the histogram can only be nonzero in the range from 0 to 4.
3. the number of years of schooling of all persons in the U. S. Ans: A bell-shaped peak around 12 years (most people finish highschool, less people attend college).
4. the IQs of all persons in the U.S. Ans: A bell-shaped curve with center at 100 and spread 15.
What do the initials SPSS mean? Ans: Statistical Package for the Social Sciences
How do you accomplish the following in SPSS?
1. Create a new dataset.
  Ans: Select New >> Data. Then type the data into the Data View.
2. Change a variable name.
  Ans: Change the variable name in the Name columns of the Variable View.
3. Add a label to a variable.
  Ans: Enter the label in the Label column in the Variable View.
4. Import a dataset from an Excel file.
  Ans: Select Import >> Data. Set the filetype to .xls and open the desired Excel file. Then indicate the worksheet you want to use and whether the variable names are in the first row.
5. Print a dataset.
  Ans: Select Analyze >> Reports >> Case Summaries. Select the variables that you want to print.
6. Obtain Q0, Q1, Q2, Q3, and Q4 for a dataset.
  Ans: Select Analyze >> Descriptive Statistics >> Explore... Click the Statistics button and check the Percentiles button.
7. Obtain a histogram and a boxplot.
  Ans: In addition to the answer 17f click the Plots button and select Histogram. The Stemplot is not needed.

Descriptive Statistics

If a histogram is bell-shaped or normal, it can be described by its center μ and spread σ:
For a bell-shaped histogram, if the sample size n is large, the statistics x (sample mean) and SD+ (sample standard deviation) are good estimates of μ and σ.
x and SD+ form a parsimonious description of a bell-shaped histogram.
Estimates of the Center: Mean, Median, Trimmed Mean, M-estimators
Estimators of the Spread: SD, SD⁺, MAD

Practice Problems

What happens to x and Q2 for a dataset
1. if every observation is increased by 7?
  Ans: Both x and Q2 are increased by 7.
  x_new = (1/n)(x₁ + 7 + ... + x_n + 7)
  = (1/n)(x₁ + ... + x_n) + (1/n)(7 + ... + 7)
  = x + (1/n) n 7 = x + 7
2. if every observation is multiplied by 3?
  Ans: Both x and Q2 are multiplied by 3.
  x_new = (1/n)(3 x₁ + ... + 3 x_n)
  = (1/n)3(x₁ + ... + x_n)
  = 3(1/n)(x₁ + ... + x_n) = 3 x
3. if the largest observation is increased by 1000?
  Ans: The mean is increased by 1000 / n, the median is unchanged if n ≥ 3.
  (1/n)(x₁ + ... + (x_n + 1000)) = x + 1000 / n
What happens to SD for a dataset if
1. if every observation is increased by 7?
  Ans: the SD is unchanged because the spread is unchanged.
2. if every observation is multiplied by 3?
  Ans: the SD is multiplied by 3 because the spread is multiplied by 3.
Show that the mean is the center of gravity of the dataset.
Ans: In class we balanced a cardboard histogram on a pencil and showed that the center of gravity is the point on the x-axis where the histogram balances (does not tip to the left or right. Here is the algebraic demonstration: m is the point where the histogram balances, and x₁ - m is the turning moment that tries to top the histogram to the left or right. A negative moment tries to tip the histogram to the left; a positive moment tries to top the histogram to the right. We want the moments to sum to zero so that the histogram balances.
(x₁ - m) + ... + (x_n - m) = 0
(1/n)[(x₁ - m) + ... + (x_n - m)] = 0
(1/n)(x₁ + ... + x_nm) - (1/n) n m = 0
x - m = 0, so m = x.
Compute the 20%-trimmed mean of this dataset:
1 7 4 6 94 5 5 7 3 6
Ans: Trimming 10% of the variables off of the bottom and 10% off of the top, means omitting 1 and 94. The average of the remaining variables is 5.375.
Without doing any calculations, compute the SD of this dataset:

4 4 4 4 4
Ans: The mean is 4; therefore the average of the squared deviations, is 0. SD = sqrt(ave of squared deviations) = 0.
Without doing any calculations, compute the SD of this dataset:
0 0 0 0 10 10 10 10
Ans: The average is 5, so the deviations consist of four -5s and four 5s. The squared deviations are all 25, so the average squared deviation is also 5. sqrt(25) = 5.
Use SPSS to compute SD+ of the hypothetical exam scores, which is Dataset 1 on the Datasets Page.
Ans: Select Analyze >> Descriptive Statistics >> Descriptives. Make sure that the mean and SD are selected.
If SD = 6.94 and the sample size n = 23, what is SD⁺?
Ans: Define SS = (x₁-x)² + ... + (x₂₃-x)². Then SD = 6.94 = sqrt(SS / 23). Solve for SS, which gives SS = 1107.76. Then SD⁺ sqrt(SS / (n-1)) = SD⁺ sqrt(1107.76 / 22) = 7.10.
Compute the MAD of this dataset:
20 10 15 15
Ans: x = 15, so MAD = (|20-15| + |10-15| + |15-15| + |15-15|) / 4 = 2.25. The absolute value of x, denoted by |x|, eliminates the sign from x.

Comparison of Mean and Median

The mean of a dataset is its center of gravity.
Find the center of gravity of a histogram cut out of cardboard.
The median divides a dataset in half.
If a histogram cut out of cardboard were cut at the median line, both of the resulting pieces would weigh the same.
The mean is affected more by changes in outliers than the median is affected.
The mean is pulled in the direction of the long tail of a skewed histogram, relative to the median.

Practice Problems

Compute the mean of each of these histograms by using a weighted average of the midpoints of each rectangle weighted by the proportion of observations represented by that rectangle.

(a)

Bin	Count
[0,1)	1
[1,2)	3
[2,3)	5
[3,4]	1

(b)

Bin	Count
[0,1)	3
[1,2)	5
[2,4]	2

(c)

Bin	Count
[0,1)	2
[1,2)	4
[2,2.5)	3
[2.5,3]	1

Ans: Compute the weighted average (x₁ w₁ + ... + x_n w_n) / (w₁ + ... + w_n) , where x₁ is the midpoint of the ith bin and w₁ is the number or proportion of observations in the ith bin.

Ans for (a):  
  0.5 x 10 + 1.5 x 30 + 2.5 x 50 + 3.5 x 10   200
  ----------------------------------------- = --- = 2.0
              10 + 30 + 50 + 10               100

Ans for (b):  
0.5 x 30 + 1.5 x 50 + 3.0 x 20    150
------------------------------ =  --- = 1.5
         30 + 50 + 20             100

Ans for (c):  
  0.5 x 20 + 1.5 x 40 + 2.25 x 30 + 2.75 x 10   165
  ------------------------------------------- = --- = 1.65.
                   20 + 40 + 30                 100

The Ideal Measurement Model

No measurement is perfect.
Every measurement involves some random error and systematic bias.
The ideal measurement model assumes that a set of measurements has no systematic bias, and that the random errors are independent with the same standard deviation everywhere (homoscedastic).
More details on the Ideal Measurement Model.

Analyze the NBS-10 Dataset

Use SPSS to obtain the following for the NIST-10 Dataset on the Datasets Page. x and SD+
1. Histograms with three different bin widths
2. Outliers using the boxplot
3. z-scores
4. Outliers using z-scores
5. Standard error of the average
6. Boxplot after removing outliers (according to first boxplot).
Warning: do not automatically delete the outliers from the dataset. They may be the most important observations in the dataset.
Here is a story (that might be an urban legend) about deleting outliers. Climatologists were studying the ozone levels in the upper atmosphere at the South Pole. In the 1960s, it was common for engineers to routinely delete outliers from the dataset, suspecting that they were bad observations. One of the data analysts therefore was deleting outliers from the Ozone observations dataset. Because of this, the hole in the ozone layer at the South pole was discovered several years later than it should have been.

IT 403 -- Sept 14, 2016

Review Problems

Descriptive Statistics

Practice Problems

Comparison of Mean and Median

Practice Problems

The Ideal Measurement Model

Analyze the NBS-10 Dataset

Project 2

The Normal Distribution

The Standard Error of the Average