17. Case Study: Working Through a HW Problem

We look at a sample homework problem and the R commands necessary to explore the problem. It is assumed that you are familiar will all of the commands discussed throughout this tutorial.

17.1. Problem Statement

This problem comes from the 5th edition of Moore and McCabe’s Introduction to the Practice of Statistics and can be found on pp. 466-467. The data consists of the emissions of three different pollutants from 46 different engines. A copy of the data we use here is available. The problem examined here is different from that given in the book but is motivated by the discussion in the book.

In the following examples we will look at the carbon monoxide data which is one of the columns of this data set. First we will transform the data so that it is close to being normally distributed. We will then find the confidence interval for the mean and then perform a significance test to evaluate whether or not the data is away from a fixed standard. Finally, we will find the power of the test to detect a fixed difference from that standard. We will assume that a confidence level of 95% is used throughout.

17.2. Transforming the Data

We first begin a basic examination of the data. A copy of the data file can be found at table_7_3.csv.The first step is to read in the file and get a summary of the center and spread of the data. In this instance we will focus only on the carbon monoxide data.

> engine <- read.csv(file="table_7_3.csv",sep=",",head=TRUE)
> names(engine)
[1] "en"  "hc"  "co"  "nox"
> summary(engine)
       en              hc               co              nox
 Min.   : 1.00   Min.   :0.3400   Min.   : 1.850   Min.   :0.490
 1st Qu.:12.75   1st Qu.:0.4375   1st Qu.: 4.388   1st Qu.:1.110
 Median :24.50   Median :0.5100   Median : 5.905   Median :1.315
 Mean   :24.00   Mean   :0.5502   Mean   : 7.879   Mean   :1.340
 3rd Qu.:35.25   3rd Qu.:0.6025   3rd Qu.:10.015   3rd Qu.:1.495
 Max.   :46.00   Max.   :1.1000   Max.   :23.530   Max.   :2.940

At first glance the carbon monoxide data appears to be skewed. The spread between the third quartile and the max is five times the spread between the min and the first quartile. A boxplot is show in Figure 1. showing that the data appears to be skewed. This is further confirmed in the histogram which is shown in Figure 2. Finally, a normal qq plot is given in Figure 3. The data does not appear to be normal.

> qqnorm(engine$co,main="Carbon Monoxide")
> qqline(engine$co)
> boxplot(engine$co,main="Carbon Monoxide")
> hist(engine$co,main="Carbon Monoxide")
> qqnorm(engine$co,main="Carbon Monoxide")
> qqline(engine$co)
>
Boxplot of the  CO data.

Figure 1.

Boxplot of the Carbon Monoxide Data.
Histogram of the CO data.

Figure 2.

Histogram of the Carbon Monoxide Data.
QQPlot of the CO data

Figure 3.

Normal QQ Plot of the Carbon Monoxide Data.

We next see if the data can be transformed to something that is closer to being normally distributed. We examine the logarithm of the data. First, the boxplot of the log of the data appears to be more evenly distributed as shown in Figure 4. Also, the histogram appears to be centered and closer to normal in Figure 5. Finally, the normal qq plot is shown in in Figure 6. It shows that the data is more consistent with what we would expect from normal data.

> lengine <- log(engine$co)
> boxplot(lengine,main="Carbon Monoxide")
> hist(lengine,main="Carbon Monoxide")
> qqnorm(lengine,main="QQ Plot for the Log of the Carbon Monoxide")
> qqline(lengine)
>
Boxplot of the log of the CO data.

Figure 4.

Boxplot of the Logarithm of the Carbon Monoxide Data.
Histogram of the logarithm of the Carbon Monoxide Data.

Figure 5.

Histogram of the Logarithm of the Carbon Monoxide Data.
QQPlot of the log of the CO data

Figure 6.

Normal QQ Plot of the Logarithm of the Carbon Monoxide Data.

There is strong evidence that the logarithm of the carbon monoxide data more closely resembles a normal distribution then does the raw carbon monoxide data. For that reason all of the analysis that follows will be for the logarithm of the data and will make use of the new list “lengine.”

17.3. The Confidence Interval

We now find the confidence interval for the carbon monoxide data. As stated above, we will work with the logarithm of the data because it appears to be closer to a normal distribution. This data is stored in the list called “lengine.” Since we do not know the true standard deviation we will use the sample standard deviation and will use a t-distribution.

We first find the sample mean, the sample standard deviation, and the number of observations:

> m <- mean(lengine)
> s <- sd(lengine)
> n <- length(lengine)
> m
[1] 1.883678
> s
[1] 0.5983851
> n
[1] 48

Now we find the standard error:

> se <- s/sqrt(n)
> se
[1] 0.08636945

Finally, the margin of error is found based on a 95% confidence level which can then be used to define the confidence interval:

> error <- se*qt(0.975,df=n-1)
> error
[1] 0.1737529
> left <- m-error
> right <- m + error
> left
[1] 1.709925
> right
[1] 2.057431

The 95% confidence interval is between 1.71 and 2.06. Keep in mind that this is for the logarithm so the 95% confidence interval for the original data can be found by “undoing” the logarithm:

> exp(left)
[1] 5.528548
> exp(right)
[1] 7.82584

So the 95% confidence interval for the carbon monoxide is between 5.53 and 7.83.

17.4. Test of Significance

We now perform a test of significance. Here we suppose that ideally the engines should have a mean level of 5.4 and do a two-sided hypothesis test. Here we assume that the true mean is labeled \(\mu_x\) and state the hypothesis test:

\[ \begin{align}\begin{aligned}H_o: \mu_x & = & 5.4,\\H_a: \mu_x & \neq & 5.4,\end{aligned}\end{align} \]

To perform the hypothesis test we first assume that the null hypothesis is true and find the confidence interval around the assumed mean. Fortunately, we can use the values from the previous step:

> lNull <- log(5.4) - error
> rNull <- log(5.4) + error
> lNull
[1] 1.512646
> rNull
[1] 1.860152
> m
[1] 1.883678

The sample mean lies outside of the assumed confidence interval so we can reject the null hypothesis. There is a low probability that we would have obtained our sample mean if the true mean really were 5.4.

Another way to approach the problem would be to calculate the actual p-value for the sample mean that was found. Since the sample mean is greater than 5.4 it can be found with the following code:

> 2*(1-pt((m-log(5.4))/se,df=n-1))
[1] 0.02692539

Since the p-value is 2.7% which is less than 5% we can reject the null hypothesis.

Note that there is yet another way to do this. The function t.test will do a lot of this work for us.

> t.test(lengine,mu = log(5.4),alternative = "two.sided")

        One Sample t-test

data:  lengine
t = 2.2841, df = 47, p-value = 0.02693
alternative hypothesis: true mean is not equal to 1.686399
95 percent confidence interval:
 1.709925 2.057431
sample estimates:
mean of x
 1.883678

More information and a more complete list of the options for this command can be found using the help command:

> help(t.test)

17.5. The Power of the test

We now find the power of the test. To find the power we need to set a level for the mean and then find the probability that we would accept the null hypothesis if the mean is really at the prescribed level. Here we will find the power to detect a difference if the level were 7. Three different methods are examined. The first is a method that some books advise to use if you do not have a non-central t-test available. The second does make use of the non-central t-test. Finally, the third method makes use of a customized R command.

We first find the probability of accepting the null hypothesis if the level really were 7. We assume that the true mean is 7 and then find the probability that a sample mean would fall within the confidence interval if the null hypothesis were true. Keep in mind that we have to transform the level of 7 by taking its logarithm. Also keep in mind that this is a two-sided test:

> tLeft <- (lNull-log(7))/(s/sqrt(n))
> tRight <- (rNull-log(7))/(s/sqrt(n))
> p <- pt(tRight,df=n-1) - pt(tLeft,df=n-1)
> p
[1] 0.1629119
> 1-p
[1] 0.8370881

So the probability of making a type II error is approximately 16.3%, and the probability of detecting a difference if the level really is 7 is approximately 83.7%.

Another way to find the power is to use a non-centrality parameter. This is the method that many books advise over the previous method. The idea is that you give it the critical t-values associated with your test and also provide a parameter that indicates how the mean is shifted.

> t <- qt(0.975,df=n-1)
> shift <- (log(5.4)-log(7))/(s/sqrt(n))
> pt(t,df=n-1,ncp=shift)-pt(-t,df=n-1,ncp=shift)
[1] 0.1628579
> 1-(pt(t,df=n-1,ncp=shift)-pt(-t,df=n-1,ncp=shift))
[1] 0.8371421

Again, we see that the power of the test is approximately 83.7%. Note that this result is slightly off from the previous answer. This approach is often recommended over the previous approach.

The final approach we examine allows us to do all the calculations in one step. It makes use of the non-centrality parameter as in the previous example, but all of the commands are done for us.

> power.t.test(n=n,delta=log(7)-log(5.4),sd=s,sig.level=0.05,
              type="one.sample",alternative="two.sided",strict = TRUE)

     One-sample t test power calculation

              n = 48
          delta = 0.2595112
             sd = 0.5983851
      sig.level = 0.05
          power = 0.8371421
    alternative = two.sided

This is a powerful command that can do much more than just calculate the power of a test. For example it can also be used to calculate the number of observations necessary to achieve a given power. For more information check out the help page, help(power.t.test).