Abstract
This review introduces methods for investigating relationships between two qualitative (categorical) variables. The χ^{2 }test of association is described, together with the modifications needed for small samples. The test for trend, in which at least one of the variables is ordinal, is also outlined. Risk measurement is discussed. The calculation of confidence intervals for proportions and differences between proportions are described. Situations in which samples are matched are considered.
Keywords:
χ^{2 }test of association; Fisher's exact test; McNemar's test; odds ratio; risk ratio; Yates' correctionIntroduction
In the previous statistics reviews most of the procedures discussed are appropriate for quantitative measurements. However, qualitative, or categorical, data are frequently collected in medical investigations. For example, variables assessed might include sex, blood group, classification of disease, or whether the patient survived. Categorical variables may also comprise grouped quantitative variables, for example age could be grouped into 'under 20 years', '20–50 years' and 'over 50 years'. Some categorical variables may be ordinal, that is the data arising can be ordered. Age group is an example of an ordinal categorical variable.
When using categorical variables in an investigation, the data can be summarized in the form of frequencies, or counts, of patients in each category. If we are interested in the relationship between two variables, then the frequencies can be presented in a twoway, or contingency, table. For example, Table 1 comprises the numbers of patients in a twoway classification according to site of central venous cannula and infectious complications. Interest here is in whether there is any relationship, or association, between the site of cannulation and the incidence of infectious complications. The question could also be phrased in terms of proportions, for example whether the proportions of patients in the three groups determined by site of central venous cannula differ according to type of infectious complication.
Table 1. Numbers of patients classified by site of central venous cannula and infectious complication
χ^{2 }test of association
In order to test whether there is an association between two categorical variables, we calculate the number of individuals we would get in each cell of the contingency table if the proportions in each category of one variable remained the same regardless of the categories of the other variable. These values are the frequencies we would expect under the null hypothesis that there is no association between the variables, and they are called the expected frequencies. For the data in Table 1, the proportions of patients in the sample with cannulae sited at the internal jugular, subclavian and femoral veins are 934/1706, 524/1706, 248/1706, respectively. There are 1305 patients with no infectious complications. So the frequency we would expect in the internal jugular site category is 1305 × (934/1706) = 714.5. Similarly for the subclavian and femoral sites we would expect frequencies of 1305 × (524/1706) = 400.8 and 1305 × (248/1706) = 189.7.
We repeat these calculations for the patients with infections at the exit site and with bacteraemia/septicaemia to obtain the following:
Exit site: 245 × (934/1706) = 134.1, 245 × (524/1706) = 75.3, 245 × 248/1706 = 35.6
Bacteraemia/septicaemia: 156 × (934/1706) = 85.4, 156 × (524/1706) = 47.9, 156 × (248/1706) = 22.7
We thus obtain a table of expected frequencies (Table 2). Note that 1305 × (934/1706) is the same as 934 × (1305/8766), and so equally we could have worded the argument in terms of proportions of patients in each of the infectious complications categories remaining constant for each central line site. In each case, the calculation is conditional on the sizes of the row and column totals and on the total sample size.
Table 2. Numbers of patients expected in each classification if there were no association between site of central venous cannula and infectious complication
The test of association involves calculating the differences between the observed and expected frequencies. If the differences are large, then this suggests that there is an association between one variable and the other. The difference for each cell of the table is scaled according to the expected frequency in the cell. The calculated test statistic for a table with r rows and c columns is given by:
where O_{ij }is the observed frequency and E_{ij }is the expectedfrequency in the cell in row i and column j. If the null hypothesis of no association is true, then the calculated test statistic approximately follows a χ^{2 }distribution with (r  1) × (c  1) degrees of freedom (where r is the number of rows and c the number of columns). This approximation can be used to obtain a P value.
For the data in Table 1, the test statistic is:
1.134 + 2.380 + 1.314 + 6.279 + 21.531 + 2.052 + 2.484 + 14.069 + 0.020 = 51.26
Comparing this value with a χ^{2 }distribution with (3  1) × (3  1) = 4 degrees of freedom, a P value of less than 0.001 is obtained either by using a statistical package or referring to a χ^{2 }table (such as Table 3), in which 51.26 being greater than 18.47 leads to the conclusion that P < 0.001. Thus, there is a probability of less than 0.001 of obtaining frequencies like the ones observed if there were no association between site of central venous line and infectious complication. This suggests that there is an association between site of central venous line and infectious complication.
Table 3. Percentage points of the χ^{2 }distribution produced on a spreadsheet
Residuals
The χ^{2 }test indicates whether there is an association between two categorical variables. However, unlike the correlation coefficient between two quantitative variables (see Statistics review 7 [1]), it does not in itself give an indication of the strength of the association. In order to describe the association more fully, it is necessary to identify the cells that have large differences between the observed and expected frequencies. These differences are referred to as residuals, and they can be standardized and adjusted to follow a Normal distribution with mean 0 and standard deviation 1 [2]. The adjusted standardized residuals, d_{ij}, are given by:
Where n_{i}. is the total frequency for row i, n._{j }is the total frequency for column j, and N is the overall total frequency. In the example, the adjusted standardized residual for those with cannulae sited at the internal jugular and no infectious complications is calculated as:
Table 4 shows the adjusted standardized residuals for each cell. The larger the absolute value of the residual, the larger the difference between the observed and expected frequencies, and therefore the more significant the association between the two variables. Subclavian site/no infectious complication has the largest residual, being 6.2. Because it is positive there are more individuals than expected with no infectious complications where the subclavian central line site was used. As these residuals follow a Normal distribution with mean 0 and standard deviation 1, all absolute values over 2 are significant (see Statistics review 2 [3]). The association between femoral site/no infectious complication is also significant, but because the residual is negative there are fewer individuals than expected in this cell. When the subclavian central line site was used infectious complications appear to be less likely than when the other two sites were used.
Table 4. The adjusted standardized residuals
Two by two tables
The use of the χ^{2 }distribution in tests of association is an approximation that depends on the expected frequencies being reasonably large. When the relationship between two categorical variables, each with only two categories, is being investigated, variations on the χ^{2}test of association are often calculated as well as, or instead of, the usual test in order to improve the approximation. Table 5 comprises data on patients with acute myocardial infarction who took part in a trial of intravenous nitrate (see Statistics review 3 [4]). A total of 50 patients were randomly allocated to the treatment group and 45 to the control group. The table shows the numbers of patients who died and survived in each group. The χ^{2 }test gives a test statistic of 3.209 with 1 degree of freedom and a P value of 0.073. This suggests there is not enough evidence to indicate an association between treatment and survival.
Table 5. Data on patients with acute myocardial infarction who took part in a trial of intravenous nitrate
Fisher's exact test
The exact P value for a two by two table can be calculated by considering all the tables with the same row and column totals as the original but which are as or more extreme in their departure from the null hypothesis. In the case of Table 5, we consider all the tables in which three or fewer patients receiving the treatment died, given in Table 6(i)–(iv). The exact probabilities of obtaining each of these tables under the null hypothesis of no association or independence between treatment and survival are obtained as follows.
To calculate the probability of obtaining a particular table, we consider the total number of possible tables with the given marginal totals, and the number of ways we could have obtained the particular cell frequencies in the table in question. The number of ways the row totals of 11 and 84 could have been obtained given 95 patients altogether is denoted by _{95}C_{11 }and is equal to 95!/11!84!, where 95! ('95 factorial') is the product of 95 and all the integers lower than itself down to 1. Similarly the number of ways the column totals of 50 and 45 could have been obtained is given by _{95}C_{50 }= 95!/50!45!. Assuming independence, the total number of possible tables with the given marginal totals is:
The number of ways Table 5 (Table 6[i]) could have been obtained is given by considering the number of ways each cell frequency could have arisen. There are _{95}C_{3 }ways of obtaining the three patients in the first cell. The eight patients in the next cell can be obtained in _{92}C_{8 }ways from the 95  3 = 92 remaining patients. The remaining cells can be obtained in _{84}C_{47 }and _{37}C_{37 }(= 1) ways. Therefore, the number of ways of obtaining Table 6(i) under the null hypothesis is:
Therefore the probability of obtaining 6(i) is:
Therefore the total probability of obtaining the four tables given in Table 6 is:
This probability is usually doubled to give a twosided P value of 0.140. There is quite a large discrepancy in this case between the χ^{2 }test and Fisher's exact test.
Yates' continuity correction
In using the χ^{2 }distribution in the test of association, a continuous probability distribution is being used to approximate discrete probabilities. A correction, attributable to Yates, can be applied to the frequencies to make the test closer to the exact test. To apply Yates' correction for continuity we increase the smallest frequency in the table by 0.5 and adjust the other frequencies accordingly to keep the row and column totals the same. Applying this correction to the data given in Table 5 gives Table 7.
Table 7. Adjusted frequencies for Yates' correction
The χ^{2 }test using these adjusted figures gives a test statistic of 2.162 with a P value of 0.141, which is close to the P value for Fisher's exact test.
For large samples the three tests – χ^{2}, Fisher's and Yates' – give very similar results, but for smaller samples Fisher's test and Yates' correction give more conservative results than the χ^{2 }test; that is the P values are larger, and we are less likely to conclude that there is an association between the variables. There is some controversy about which method is preferable for smaller samples, but Bland [5] recommends the use of Fisher's or Yates' test for a more cautious approach.
Test for trend
Table 8 comprises the numbers of patients in a twoway classification according to AVPU classification (voice and pain responsive categories combined) and subsequent survival or death of 1306 patients attending an accident and emergency unit. (AVPU is a system for assessing level of consciousness: A = alert, V = voice responsiveness, P = pain responsive and U = unresponsive.) The χ^{2 }test of association gives a test statistic of 19.38 with 2 degrees of freedom and a P value of less than 0.001, suggesting that there is an association between survival and AVPU classification.
Table 8. Number of patients according to AVPU and survival
Because the categories of AVPU have a natural ordering, it is appropriate to ask whether there is a trend in the proportion dying over the levels of AVPU. This can be tested by carrying out similar calculations to those used in regression for testing the gradient of a line (see Statistics review 7 [1]). Suppose the variable 'survival' is regarded as the y variable taking two values, 1 and 2 (survived and died), and AVPU as the x variable taking three values, 1, 2 and 3. We then have six pairs of x, y values, each occurring the number of times equal to the frequency in the table; for example, we have 1110 occurrences of the point (1,1).
Following the lines of the test of the gradient in regression, with some fairly minor modifications and using large sample approximations, we obtain a χ^{2 }statistic with 1 degree of freedom given by [5]:
For the data in Table 8, we obtain a test statistic of 19.33 with 1 degree of freedom and a P value of less than 0.001. Therefore, the trend is highly significant. The difference between the χ^{2 }test statistic for trend and the χ^{2 }test statistic in the original test is 19.38  19.33 = 0.05 with 2  1 = 1 degree of freedom, which provides a test of the departure from the trend. This departure is very insignificant and suggests that the association between survival and AVPU classification can be explained almost entirely by the trend.
Some computer packages give the trend test, or a variation. The trend test described above is sometimes called the Cochran–Armitage test, and a common variation is the Mantel–Haentzel trend test.
Measurement of risk
Another application of a two by two contingency table is to examine the association between a disease and a possible risk factor. The risk for developing the disease if exposed to the risk factor can be calculated from the table. A basic measurement of risk is the probability of an individual developing a disease if they have been exposed to a risk factor (i.e. the relative frequency or proportion of those exposed to the risk factor that develop the disease). For example, in the study into early goaldirected therapy in the treatment of severe sepsis and septic shock conducted by Rivers and coworkers [6], one of the outcomes measured was inhospital mortality. Of the 263 patients who were randomly allocated either to early goaldirected therapy or to standard therapy, 236 completed the therapy period with the outcomes shown in Table 9.
Table 9. Outcomes of the study conducted by Rivers and coworkers
From the table it can be seen that the proportion of patients receiving early goaldirected therapy who died is 38/117 = 32.5%, and so this is the risk for death with early goaldirected therapy. The risk for death on the standard therapy is 59/119 = 49.6%.
Another measurement of the association between a disease and possible risk factor is the odds. This is the ratio of those exposed to the risk factor who develop the disease compared with those exposed to the risk factor who do not develop the disease. This is best illustrated by a simple example. If a bag contains 8 red balls and 2 green balls, then the probability (risk) of drawing a red ball is 8/10 whereas the odds of drawing a red ball is 8/2. As can be seen, the measurement of odds, unlike risk, is not confined to the range 0–1. In the study conducted by Rivers and coworkers [6] the odds of death with early goaldirected therapy is 38/79 = 0.48, and on the standard therapy it is 59/60 = 0.98.
Confidence interval for a proportion
As the measurement of risk is simply a proportion, the confidence interval for the population measurement of risk can be calculated as for any proportion. If the number of individuals in a random sample of size n who experience a particular outcome is r, then r/n is the sample proportion, p. For large samples the distribution of p can be considered to be approximately Normal, with a standard error of [2]:
The 95% confidence interval for the true population proportion, p, is given by p  1.96 × standard error to p + 1.96 × standard error, which is:
where p is the sample proportion and n is the sample size. The sample proportion is the risk and the sample size is the total number exposed to the risk factor.
For the study conducted by Rivers and coworkers [6] the 95% confidence interval for the risk for death on early goaldirected therapy is 0.325 ± 1.96(0.325 [10.325]/117)^{0.5 }or (24.0%, 41.0%), and on the standard therapy it is (40.6%, 58.6%). The interpretation of a confidence interval is described in (see Statistics review 2 [3]) and indicates that, for those on early goaldirected therapy, the true population risk for death is likely to be between 24.0% and 41.0%, and that for the standard therapy between 40.6% and 58.6%.
Comparing risks
To assess the importance of the risk factor, it is necessary to compare the risk for developing a disease in the exposed group with the risk in the nonexposed group. In the study by Rivers and coworkers [6] the risk for death on the early goaldirected therapy is 32.5%, whereas on the standard therapy it is 49.6%. A comparison between the two risks can be made by examining either their ratio or the difference between them.
Risk ratio
The risk ratio measures the increased risk for developing a disease when having been exposed to a risk factor compared with not having been exposed to the risk factor. It is given by RR = risk for the exposed/risk for the unexposed, and it is often referred to as the relative risk. The interpretation of a relative risk is described in Statistics review 6 [7]. For the Rivers study the relative risk = 0.325/0.496 = 0.66, which indicates that a patient on the early goaldirected therapy is 34% less likely to die than a patient on the standard therapy.
The calculation of the 95% confidence interval for the relative risk [8] will be covered in a future review, but it can usefully be interpreted here. For the Rivers study the 95% confidence interval for the population relative risk is 0.48 to 0.90. Because the interval does not contain 1.0 and the upper end is below, it indicates that patients on the early goaldirected therapy have a significantly decreased risk for dying as compared with those on the standard therapy.
Odds ratio
When quantifying the risk for developing a disease, the ratio of the odds can also be used as a measurement of comparison between those exposed and not exposed to a risk factor. It is given by OR = odds for the exposed/odds for the unexposed, and is referred to as the odds ratio. The interpretation of odds ratio is described in Statistics review 3 [4]. For the Rivers study the odds ratio = 0.48/0.98 = 0.49, again indicating that those on the early goaldirected therapy have a reduced risk for dying as compared with those on the standard therapy. This will be covered fully in a future review.
The calculation of the 95% confidence interval for the odds ratio [2] will also be covered in a future review but, as with relative risk, it can usefully be interpreted here. For the Rivers example the 95% confidence interval for the odds ratio is 0.29 to 0.83. This can be interpreted in the same way as the 95% confidence interval for the relative risk, indicating that those receiving early goaldirected therapy have a reduced risk for dying.
Difference between two proportions
Confidence interval
For the Rivers study, instead of examining the ratio of the risks (the relative risk) we can obtain a confidence interval and carry out a significance test of the difference between the risks. The proportion of those on early goaldirected therapy who died is p_{1 }= 38/117 = 0.325 and the proportion of those on standard therapy who died is p_{2 }= 59/119 = 0.496. A confidence interval for the difference between the true population proportions is given by:
(p_{1 } p_{2})  1.96 × se(p_{1 } p_{2}) to (p_{1 } p_{2}) + 1.96 × se(p_{1 } p_{2})
Where se(p_{1 } p_{2}) is the standard error of p_{1 } p_{2 }and is calculated as:
Thus, the required confidence interval is 0.171  1.96 × 0.063 to 0.171 + 1.96 × 0.063; that is 0.295 to 0.047. Therefore, the difference between the true proportions is likely to be between 0.295 and 0.047, and the risk for those on early goaldirected therapy is less than the risk for those on standard therapy.
Hypothesis test
We can also carry out a hypothesis test of the null hypothesis that the difference between the proportions is 0. This follows similar lines to the calculation of the confidence interval, but under the null hypothesis the standard error of the difference in proportions is given by:
where p is a pooled estimate of the proportion obtained from both samples [5]:
So:
The test statistic is then:
Comparing this value with a standard Normal distribution gives p = 0.007, again suggesting that there is a difference between the two population proportions. In fact, the test described is equivalent to the χ^{2}test of association on the two by two table. The χ^{2 }test gives a test statistic of 7.31, which is equal to (2.71)^{2 }and has the same P value of 0.007. Again, this suggests that there is a difference between the risks for those receiving early goaldirected therapy and those receiving standard therapy.
Matched samples
Matched pair designs, as discussed in Statistics review 5 [9], can also be used when the outcome is categorical. For example, when comparing two tests to determine a particular condition, the same individuals can be used for each test.
McNemar's test
In this situation, because the χ^{2 }test does not take pairing into consideration, a more appropriate test, attributed to McNemar, can be used when comparing these correlated proportions.
For example, in the comparison of two diagnostic tests used in the determination of Helicobacter pylori, the breath test and the Oxoid test, both tests were carried out in 84 patients and the presence or absence of H. pylori was recorded for each patient. The results are shown in Table 10, which indicates that there were 72 concordant pairs (in which the tests agree) and 12 discordant pairs (in which the tests disagree). The null hypothesis for this test is that there is no difference in the proportions showing positive by each test. If this were true then the frequencies for the two categories of discordant pairs should be equal [5]. The test involves calculating the difference between the number of discordant pairs in each category and scaling this difference by the total number of discordant pairs. The test statistic is given by:
Table 10. The results of two tests to determine the presence of Helicobacter pylori
Where b and c are the frequencies in the two categories of discordant pairs (as shown in Table 10). The calculated test statistic is compared with a χ^{2 }distribution with 1 degree of freedom to obtain a P value. For the example b = 8 and c = 4, therefore the test statistic is calculated as 1.33. Comparing this with a χ^{2 }distribution gives a P value greater than 0.10, indicating no significant difference in the proportion of positive determinations of H. pylori using the breath and the Oxoid tests.
The test can also be carried out with a continuity correction attributed to Yates [5], in a similar way to that described above for the χ^{2}test of association. The test statistic is then given by:
and again is compared with a χ^{2 }distribution with 1 degree of freedom. For the example, the calculated test statistic including the continuity correct is 0.75, giving a P value greater than 0.25.
As with nonpaired proportions a confidence interval for the difference can be calculated. For large samples the difference between the paired proportions can be approximated to a Normal distribution. The difference between the proportions can be calculated from the discordant pairs [8], so the difference is given by (b  c)/n, where n is the total number of pairs, and the standard error of the difference by (b + c)^{0.5}/n.
For the example where b = 8, c = 4 and n = 84, the difference is calculated as 0.048 and the standard error as 0.041. The approximate 95% confidence interval is therefore 0.048 ± 1.96 × 0.041 giving 0.033 to 0.129. As this spans 0, it again indicates that there is no difference in the proportion of positive determinations of H. pylori using the breath and the Oxoid tests.
Limitations
For a χ^{2 }test of association, a recommendation on sample size that is commonly used and attributed to Cochran [5] is that no cell in the table should have an expected frequency of less than one, and no more than 20% of the cells should have an expected frequency of less than five. If the expected frequencies are too small then it may be possible to combine categories where it makes sense to do so.
For two by two tables, Yates' correction or Fisher's exact test can be used when the samples are small. Fisher's exact test can also be used for larger tables but the computation can become impossibly lengthy.
In the trend test the individual cell sizes are not important but the overall sample size should be at least 30.
The analyses of proportions and risks described above assume large samples with similar requirement to the χ^{2 }test of association [8].
The sample size requirement often specified for McNemar's test and confidence interval is that the number of discordant pairs should be at least 10 [8].
Conclusion
The χ^{2 }test of association and other related tests can be used in the analysis of the relationship between categorical variables. Care needs to be taken to ensure that the sample size is adequate.
Competing interests
None declared.
Box
This article is the eighth in an ongoing, educational review series on medical statistics in critical care.
Previous articles have covered 'presenting and summarizing data', 'samples and populations', 'hypothesestesting and P values', 'sample size calculations', 'comparison of means', 'nonparametric means' and 'correlation and regression'.
Future topics to be covered include:
Chisquared and Fishers exact tests
Analysis of variance
Further nonparametric tests: Kruskal–Wallis and Friedman
Measures of disease: PR/OR
Survival data: Kaplan–Meier curves and log rank tests
ROC curves
Multiple logistic regression.
If there is a medical statistics topic you would like explained, contact us at editorial@ccforum.com.
Abbreviations
AVPU: A = alert, V = voice responsiveness, P = pain responsive and U = unresponsive
References

Bewick V, Cheek L, Ball J: Statistics review 7: Correlation and regression.
Crit Care 2003, 7:451459. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Everitt BS: The Analysis of Contingency Tables. 2nd edition. London, UK: Chapman & Hall; 1992.

Whitley E, Ball J: Statistics review 2: samples and populations.
Crit Care 2002, 6:143148. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Whitley E, Ball J: Statistics review 3: hypothesis testing and P values.
Crit Care 2002, 6:222225. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Bland M: An Introduction to Medical Statistics. 3rd edition. Oxford, UK: Oxford University Press; 2001.

Rivers E, Nguyen B, Havstad S, Ressler J, Muzzin A, Knoblich B, Peterson E, Tomlanovich M, Early GoalDirected Therapy Collaborative Group: Early goaldirected therapy in the treatment of severe sepsis and septic shock.
N Engl J Med 2001, 345:13681377. PubMed Abstract  Publisher Full Text

Whitley E, Ball J: Statistics review 6: Nonparametric methods.
Crit Care 2002, 6:509513. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Kirkwood BR, Sterne JAC: Essential Medical Statistics. 2nd edition. Oxford, UK: Blackwell Science Ltd; 2003.

Whitley E, Ball J: Statistics review 5: Comparison of means.
Crit Care 2002, 6:424428. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text