12. Two Way Tables¶
Contents
Here we look at some examples of how to work with two way tables. We assume that you can enter data and understand the different data types.
12.1. Creating a Table from Data¶
We first look at how to create a table from raw data. Here we use a fictitious data set, smoker.csv. This data set was created only to be used as an example, and the numbers were created to match an example from a text book, p. 629 of the 4th edition of Moore and McCabe’s Introduction to the Practice of Statistics. You should look at the data set in a spreadsheet to see how it is entered. The information is ordered in a way to make it easier to figure out what information is in the data.
The idea is that 356 people have been polled on their smoking status (Smoke) and their socioeconomic status (SES). For each person it was determined whether or not they are current smokers, former smokers, or have never smoked. Also, for each person their socioeconomic status was determined (low, middle, or high). The data file contains only two columns, and when read R interprets them both as factors:
> smokerData <- read.csv(file='smoker.csv',sep=',',header=T)
> summary(smokerData)
Smoke SES
current:116 High :211
former :141 Low : 93
never : 99 Middle: 52
You can create a two way table of occurrences using the table command and the two columns in the data frame:
> smoke <- table(smokerData$Smoke,smokerData$SES)
> smoke
High Low Middle
current 51 43 22
former 92 28 21
never 68 22 9
In this example, there are 51 people who are current smokers and are in the high SES. Note that it is assumed that the two lists given in the table command are both factors. (More information on this is available in the chapter on data types.)
12.2. Creating a Table Directly¶
Sometimes you are given data in the form of a table and would like to create a table. Here we examine how to create the table directly. Unfortunately, this is not as direct a method as might be desired. Here we create an array of numbers, specify the row and column names, and then convert it to a table.
In the example below we will create a table identical to the one given above. In that example we have 3 columns, and the numbers are specified by going across each row from top to bottom. We need to specify the data and the number of rows:
> smoke <- matrix(c(51,43,22,92,28,21,68,22,9),ncol=3,byrow=TRUE)
> colnames(smoke) <- c("High","Low","Middle")
> rownames(smoke) <- c("current","former","never")
> smoke <- as.table(smoke)
> smoke
High Low Middle
current 51 43 22
former 92 28 21
never 68 22 9
12.3. Tools For Working With Tables¶
Here we look at some of the commands available to help look at the information in a table in different ways. We assume that the data using one of the methods above, and the table is called “smoke.” First, there are a couple of ways to get graphical views of the data:
> barplot(smoke,legend=T,beside=T,main='Smoking Status by SES')
> plot(smoke,main="Smoking Status By Socioeconomic Status")
There are a number of ways to get the marginal distributions using the margin.table command. If you just give the command the table it calculates the total number of observations. You can also calculate the marginal distributions across the rows or columns based on the one optional argument:
> margin.table(smoke)
[1] 356
> margin.table(smoke,1)
current former never
116 141 99
> margin.table(smoke,2)
High Low Middle
211 93 52
Combining these commands you can get the proportions:
> smoke/margin.table(smoke)
High Low Middle
current 0.14325843 0.12078652 0.06179775
former 0.25842697 0.07865169 0.05898876
never 0.19101124 0.06179775 0.02528090
> margin.table(smoke,1)/margin.table(smoke)
current former never
0.3258427 0.3960674 0.2780899
> margin.table(smoke,2)/margin.table(smoke)
High Low Middle
0.5926966 0.2612360 0.1460674
That is a little obtuse, so fortunately, there is a better way to get the proportions using the prop.table command. You can specify the proportions with respect to the different marginal distributions using the optional argument:
> prop.table(smoke)
High Low Middle
current 0.14325843 0.12078652 0.06179775
former 0.25842697 0.07865169 0.05898876
never 0.19101124 0.06179775 0.02528090
> prop.table(smoke,1)
High Low Middle
current 0.4396552 0.3706897 0.1896552
former 0.6524823 0.1985816 0.1489362
never 0.6868687 0.2222222 0.0909091
> prop.table(smoke,2)
High Low Middle
current 0.2417062 0.4623656 0.4230769
former 0.4360190 0.3010753 0.4038462
never 0.3222749 0.2365591 0.1730769
If you want to do a chi-squared test to determine if the proportions are different, there is an easy way to do this. If we want to test at the 95% confidence level we need only look at a summary of the table:
> summary(smoke)
Number of cases in table: 356
Number of factors: 2
Test for independence of all factors:
Chisq = 18.51, df = 4, p-value = 0.0009808
Since the p-value is less that 5% we can reject the null hypothesis at the 95% confidence level and can say that the proportions vary.
Of course, there is a hard way to do this. This is not for the faint of heart and involves some linear algebra which we will not describe. If you wish to calculate the table of expected values then you need to multiply the vectors of the margins and divide by the total number of observations:
> expected <- as.array(margin.table(smoke,1)) %*% t(as.array(margin.table(smoke,2))) / margin.table(smoke)
> expected
High Low Middle
current 68.75281 30.30337 16.94382
former 83.57022 36.83427 20.59551
never 58.67697 25.86236 14.46067
(The “t” function takes the transpose of the array.)
The result in this array and can be directly compared to the existing table. We need the square of the difference between the two tables divided by the expected values. The sum of all these values is the Chi-squared statistic:
> chi <- sum((expected - as.array(smoke))^2/expected)
> chi
[1] 18.50974
We can then get the p-value for this statistic:
> 1-pchisq(chi,df=4)
[1] 0.0009808236
12.4. Graphical Views of Tables¶
The plot command will automatically produce a mosaic plot if its primary argument is a table. Alternatively, you can call the mosaicplot command directly.
> smokerData <- read.csv(file='smoker.csv',sep=',',header=T)
> smoke <- table(smokerData$Smoke,smokerData$SES)
> mosaicplot(smoke)
> help(mosaicplot)
>
The mosaicplot command takes many of the same arguments for annotating a plot:
> mosaicplot(smoke,main="Smokers",xlab="Status",ylab="Economic Class")
>
If you wish to switch which side (horizontal versus vertical) to determine the primary proportion then you can use the sort option. This can be used to switch whether the width or height is used for the first proportional length:
> mosaicplot(smoke,main="Smokers",xlab="Status",ylab="Economic Class")
> mosaicplot(smoke,sort=c(2,1))
>
Finally if you wish to switch which side is used for the vertical and horzintal axis you can use the dir option:
> mosaicplot(smoke,main="Smokers",xlab="Status",ylab="Economic Class")
> mosaicplot(smoke,dir=c("v","h"))
>