IT 223 -- Jan 21, 2016

Review Exercises

What percentage of observations are in the interval [200, 350) for this density histogram? The label for the vertical axis is percent per horizontal unit

0.5 +
    |
0.4 +           +--------+
    |           |        |
0.3 +           |        |
    |           |        |
0.2 +     +-----+        |
    |     |     |        |
0.1 +     |     |        +-----------+
    |     |     |        |           |
0.0 +-----+-----+-----+--+--+-----+--+--+
      0  100   200   300   400   500   600
                Horizontal Units

The percentage of observations in a bin is represented by the area of the histogram bar. For the bar over the interval [200, 350), the area is

(350 - 200) * 0.4 = 150 * 0.4 = 60%

The horizontal units of the histogram are percent per horizontal unit.

What is a parsimonious description of a histogram?
Answer: it means the histogram can be described succintly or with a very few descriptors. In the case of a normal histogram, it can be described with only two descriptors, the sample mean and the sample standard deviation. We can't use less than two descriptors, because both the center and the spread of the histogram must be described.
What does the R operator $ do?
Answer: it selects a column out of a dataframe and returns it as a vector.

How do you create a dataframe from a CSV file?
Answer: Create the file ht-wt.txt in the directory C:/workspace. Then use these R statements:

> setwd("C:/workspace")
> getwd( )
[1] "C:/workspace"
> htWtDf <- read.csv("ht-wt.txt")
> print(htWtDf)
   Name Height Weight
1 Susan   1.56     61
2 David   1.78     84
3 Julie   1.65     51

How do you create a dataframe named htWtDf from these R vectors without using a CSV file?

n <- c("Susan", "David", "Julie")
h <- c(1.56, 1.78, 1.65)
w <- c(61, 84, 51)

Answer:

> htWtDf <- data.frame(Name=n, Height=h, Weight=w)
> print(htWtDf)
   Name Height Weight
1 Susan   1.56     61
2 David   1.78     84
3 Julie   1.65     51

You can also input the vectors directly into the dataframe without creating variables for them:

htWtDf <- data.frame(
    Name=  c("Susan", "David", "Julie"),
    Height=c(1.56, 1.78, 1.65), 
    Weight=c(61, 84, 51))

The Standard Deviation

The standard deviation is used to estimate the spread of the data.
Estimators of the Spread of a Dataset: SD SD⁺ MAD
Practice Problems
1. Without doing any calculations, compute the SD of this dataset:
  4 4 4 4 4
  Ans: The mean is 4; therefore the average of the squared deviations, is 0. SD = sqrt(ave of squared deviations) = sqrt(0) = 0.
2. Without doing any calculations, compute the SD of this dataset:
  0 0 0 0 10 10 10 10
  Ans: The average is 5, so the deviations consist of four -5s and four 5s. The squared deviations are all 25, so the average squared deviation is also 5. sqrt(25) = 5.
3. Use R to compute SD+ of the hypothetical exam scores.
  Ans: if x is a data vector, use sd(x) to compute SD+.
4. Compute the MAD of this dataset:
  20 10 15 15 Ans: x = 15, so MAD = (|20-15| + |10-15| + |15-15| + |15-15|) / 4 = 2.25. The absolute value of x, denoted by |x|, eliminates the sign from x.

z-scores

Another name for z-scores is standardized scores.
Consider the data vector x that is defined as x₁, ... , x_n. X is the sample mean and SD+ is the sample standard deviation. The z-scores are defined as
z_i = (x_i - X) / SD+.
Compute z-scores using R like this:
```
z <- (x - mean(x)) / sd(x)
```
A z-score indicates how many standard deviations the observation is from the sample mean.
z-scores can be used to define outliers:
an extreme outlier is an observation that has a z-score greater than 3.0 or less than -3.0.
a mild outlier is is an observation whose z-score is between 2.0 and 3.0 or between -3.0 and -2.0.
The outliers defined by z-scores are usually different than the outliers defined by the boxplot.

Practice Problem: compute the z-scores to classify the outliers of the PaperThickness Dataset. Answer:

> setwd("C:/workspace")
> t <- read.csv("paper-thickness.txt")$Thickness
> z <- (t - mean(t)) / sd(t)
> print(z)
[1] 0.56121951 0.28684553 0.28684553 0.01247154 0.28684553 0.28684553
[7] 0.01247154 0.28684553 1.10996747 -0.26190244 -1.08502438 -0.81065040
[13] -0.53627642 -2.18252031 -0.81065040 2.20746340 0.56121951 1.38434145
[19] -1.35939837 -1.35939837 0.56121951 0.56121951

There is one mild outlier, which is -2.18252031, at index 14.

Project 2a

Look at the Project 2a Description.

Here is an example using the PaperThickness dataset that displays a histogram of the data with three different bin widths by explicitly setting the breakpoints for two of the histograms. The first histogram uses the default breakpoints.

setwd("C:/workspace")
t <- read.csv("paper-thickness.txt")$Thickness
hist(t)
# Look at the default histogram before
# trying to set the breakpoints of the
# next two histograms
hist(t, breaks=c(0.09, 0.10, 0.11, 0.12))
hist(t, breaks=seq(0.09, 0.12, 0.002))

# Remember, seq(s, e, b)) creates a sequence
# that starts at s, ends at e, and increases
# by b until it reaches e.

IT 223 -- Jan 21, 2016

Review Exercises

The Standard Deviation

z-scores

Project 2a

The Normal Distribution