Lecture 12
Contingency Analysis

ABD 3e Chapter 9

Chris Merkord

Learning Objectives

Construct and interpret contingency tables.
Compute and interpret risk, relative risk (RR), odds, and odds ratio (OR).
Select RR vs. OR based on study design (cohort vs. case-control).
State and test hypotheses of independence using the chi-square test.
Calculate expected counts and the chi-square statistic.
Make decisions using \(P\)-values or critical values.
Identify when alternative tests (e.g., Fisher’s exact test) are appropriate.

Contingency Analysis: Relationships Between Categorical Variables

We now move from one categorical variable to two categorical variables

Examples:

- Treatment (Aspirin vs Placebo) × Cancer (Yes/No)
- Sex (Men/Women) × Survival (Yes/No)
- Smoking (Yes/No) × Lung Disease (Yes/No)

Core Question:

Does the distribution of one variable depend on the other?

If yes → association
If no → independence

Example: Does knowing a person’s sex tell you anything about the chance they survived the Titanic?

Two Complementary Goals: Estimation and Hypothesis Testing Answer Difference Questions

1️⃣ Estimation

How strong is the relationship?

Relative Risk (RR)
Odds Ratio (OR)

These quantify the magnitude of association.

Primarily defined for 2×2 tables (binary exposure × binary outcome)

2️⃣ Hypothesis Testing

Is the relationship statistically detectable?

Chi-square contingency test

This evaluates evidence for association.

Applicable to any r × c contingency table (two categorical variables with two or more categories each)

What is a Contingency Table?

A contingency table summarizes the joint distribution of two categorical variables
Rows = categories of one variable
Columns = categories of the second variable
Each cell = a count (frequency)
Margins show row and column totals
Allows us to:
- Compare conditional proportions
- Assess association vs independence
- Compute relative risk and odds ratios

Survival	Sex
Survival	Men	Women
Survived	338	316	654
Died	1329	109	1438
	1667	425	2092

Figure 1: Contingency table summarizing the Titanic data from Whitlock & Schluter.

Example Study: Aspirin and Cancer Risk

A large randomized study investigated whether regular aspirin use reduces cancer risk.

Two categorical variables:

Treatment: (Aspirin vs. Placebo)
Cancer Outcome: Cancer, No Cancer

Each participant falls into exactly one cell of a 2 × 2 contingency table.

Our goal:

Does cancer risk differ between the aspirin and placebo groups?

Conditional Probability (Risk)

A risk is a conditional probability:

\[ \text{Risk} = \operatorname{Pr}(\text{Cancer} \mid \text{Group}) \]

We compute risk as a proportion by dividing:

\[ p=\frac{\text{number with cancer in group}}{\text{total in group}} \]

Example: Risk of Cancer for each Group

Risk of cancer in the aspirin group:

\[ \operatorname{Pr}(\text{Cancer} \mid \text{Aspirin}) \]

\[ \hat{p}_1=\frac{1438}{1438+18496}=0.0721 \]

Risk of cancer in the placebo group:

\[ \operatorname{Pr}(\text{Cancer} \mid \text{Placebo}) \]

\[ \hat{p}_2=\frac{1427}{1427+18515}=0.0716 \]

Relative Risk (RR)

Now we compare risks across groups.

Relative risk of aspirin:

\[ RR = \frac{ \operatorname{Pr}(\text{Cancer} \mid \text{Aspirin}) }{ \operatorname{Pr}(\text{Cancer} \mid \text{Placebo}) } \]

\[ \hat{RR}=\frac{\hat{p}_1}{\hat{p}_2} \]

\[ =\frac{0.0721}{0.0716}=1.007 \]

Interpretation:

\(RR = 1\) → same risk in both groups
\(RR > 1\) → higher risk in aspirin group
\(RR < 1\) → lower risk in aspirin group

Relative risk measures the magnitude of association between treatment and outcome.

Odds: A Different Way to Express Probability

Risk compares outcome to total.
Odds compare outcome to non-outcome.
Odds are not bounded between 0 and 1.
When to use: 2 variables, each with 2 categories

Risk (probability):

\[ \operatorname{Pr}(\text{Cancer}) \]

Odds:

\[ \frac{\operatorname{Pr}(\text{Cancer})} {\operatorname{Pr}(\text{No Cancer})} \]

\[ =\frac{\operatorname{Pr}(\text{Cancer})} {1 - \operatorname{Pr}(\text{Cancer})} \]

Odds of developing cancer while taking aspirin

Success

\[ \hat{p}_1=\frac{1438}{1438+18496}=0.0721 \]

Failure

\[ 1-\hat{p}_1=1-0.0721=0.9279 \]

Odds of success

\[ \hat{O}_1=\frac{\hat{p}_1}{1-\hat{p}_1}=\frac{0.0721}{0.9279}=0.0777 \]

Odds of developing cancer while taking placebo

Success

\[ \hat{p}_2=\frac{1427}{1427+18515}=0.0716 \]

Failure

\[ 1-\hat{p}_2=1-0.0716=0.9274 \]

Odds of success

\[ \hat{O}_2=\frac{\hat{p}_2}{1-\hat{p}_2}=\frac{0.0721}{0.9284}=0.0771 \]

Odds Ratio (OR)

The odds ratio compares two odds:

\[ \hat{OR} = \frac{\text{Odds}(\text{Cancer} \mid \text{Aspirin})} {\text{Odds}(\text{Cancer} \mid \text{Placebo})} \]

Interpretation:

\(OR = 1\) → same odds in both groups
\(OR > 1\) → higher odds in numerator group
\(OR < 1\) → lower odds in numerator group

For the aspirin study:

\[ \hat{OR} = \frac{0.0777}{0.0771} = 1.008 \]

The odds of cancer are essentially the same in the aspirin and placebo groups.

Odds ratio shortcut

The following is a shortcut formula where 𝑎, 𝑏, 𝑐, and 𝑑 refer to the observed frequencies in the cells of the contingency table:

\[ \hat{OR}=\frac{a/c}{b/d}=\frac{ad}{bc} \]

Standard error for odds ratio

The sampling distribution for the odds ratio is highly skewed, so we convert the odds ratio to its natural log, \(\operatorname{ln}(\hat{OR})\)

\[ \operatorname{SE}[\operatorname{ln}(\hat{OR})]=\sqrt{\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+\frac{1}{d}} \]

This is called the standard error of the log-odds ratio

For the aspirin example:

\[ \operatorname{SE}[\operatorname{ln}(\hat{OR})]=\sqrt{\frac{1}{1438}+\frac{1}{1427}+\frac{1}{18496}+\frac{1}{18515}}=0.03878 \]

Confidence interval for odds ratio

We can approximate a confidence interval for the log-odds ratio:

\[ \operatorname{ln}(\hat{OR}) \pm Z \times \operatorname{SE}[\operatorname{ln}(\hat{OR})] \]

where \(Z=1.96\) for a 95% CI and \(Z=2.58\) for a 99% CI

To get the CI for the odds ratio itself, you have to take the antilog of the upper and lower bounds:

\[ e^x < \operatorname{OR} < e^y \]

Odds ratio interpretation

If OR=1 there is no association between exposure and outcome
If 95% CI includes 1, results are not statistically significant

Relative Risk vs. Odds Ratio

Relative Risk (RR)

Requires estimating risk (probability)
Risk requires a meaningful denominator:
- number with outcome
- total number at risk

Best used when

Cohort studies
Randomized experiments
The total population at risk is known

Odds Ratio (OR)

Does not require estimating population risk
Can be computed even when totals at risk are unknown

Best used when

Case-control studies
Logistic regression

If the outcome is rare, \(OR \approx RR\)

Case-Control Study

Investigates associations between an exposure and an outcome.
Start with individuals who already have the outcome (cases)
Select a comparison group without the outcome (controls)
Look backward to determine prior exposure status
The total population at risk is unknown, so risk cannot be estimated directly. Use odds ratio instead.

Cohort Study

Follows individuals forward in time to assess whether an exposure is associated with an outcome.
Begin with individuals classified by exposure status (Exposed, Unexposed)
Follow both groups over time
Record who develops the outcome
The total number at risk in each group is known, so risk can be estimated directly.
We can compute relative risk.
Types: Prospective or Retrospective (see diagram)

The \(\chi^2\) contingency test

The \(\chi^2\) contingency test is the most commonly used test of association between two categorical

It tests the goodness of fit to the data of the null model of independence of variables.
RR and OR allow us to estimate magnitude of association, but do not test whether an association may be caused by chance alone.

Example: Consider the life cycle of the trematode E. californensis

Trematode worms Euhaplorchis californensis use three hosts during their life cycle
Mature worms in birds lay eggs that pass out in bird’s feces
Horn snails Cerithidea californica eat the eggs
Eggs hatch and grow to another life stage in the snail, sterilizing it
Californa killifish Fundulus parvipinnis eat the snail
Parasite develops to the next life stage and encysts in the brain
Birds eat infected killifish, worm matures in the bird

Research on fish behavior

Researchers have observed that infected fish spend excessive time near the water surface
They may be more vulnerable to bird predation, which would benefit the worm
Lafferty and Morris (1996) tested whether bird predation varies with severity of infection
Fish placed into outdoor pens open to bird predation, with fish of varying infection intensity:
- highly infected
- lightly infected
- not infected

Illustration of three rectangular mesh fish pens floating in a marsh channel. Each pen is labeled with a sign: “Highly Infected,” “Lightly Infected,” and “Uninfected.” The highly infected pen contains fish with numerous dark spots indicating heavy parasite loads; the lightly infected pen shows fish with fewer spots; the uninfected pen shows fish without spots. A large heron stands on the muddy bank at left, a kingfisher perches on a wire above the water at right, and a tern hovers overhead. The scene represents an experimental setup comparing bird predation across different infection intensities. — Stylized depiction of experimental fish pens used to test whether bird predation varies with parasite infection intensity (after Lafferty & Morris 1996).

Observed frequencies of fish eaten or not eaten by birds according to trematode infection level

Table 1. Observed Frequencies.

	Not Infected	Lightly Infected	Highly Infected
Eaten by birds	1	10	37
Not eaten by birds	49	35	9

Question:

Is being eaten by birds (outcome) independent from infection level?

Steps to hypothesis testing

State hypotheses
Calculate test statistic ( \(\chi_2\) )
1. Calculate row, column, and grand totals
2. Calculate expected proportions assuming independence
3. Calculate expected frequencies assuming independence
4. Calculate difference between observed and expected frequencies ( \(\chi_2\) )
Calculate \(P\)-value
Interpret hypotheses (2 ways):
- \(P<\alpha\)
- \(\chi^2_{df}>\chi^2_{crit}\)

STEP 1: State the hypotheses

\(H_0\) : Parasite infection level and being eaten are independent

\(H_A\) : Parasite infection level and being eaten are not independent

STEP 2: Calculate the test statistic ( \(\chi^2\) )

Goal: calculate the \(\chi^2\) from the data to see how different the observed frequencies are from the expected frequencies

Start with: contingency table of observed frequencies

Next step: 2a. Calculate row, column, and grand totals

Table 1. Observed Frequencies.

	Not Infected	Lightly Infected	Highly Infected
Eaten by birds	1	10	37
Not eaten by birds	49	35	9

STEP 2A: Calculate row, column, and grand totals

Sum the values in each row to get row totals
Sum the values in each column to get column totals
Sum the row or column totals to get the grand total

Table 1. Observed Frequencies.

	Not Infected	Lightly Infected	Highly Infected	Row total
Eaten by birds	1	10	37	48
Not eaten by birds	49	35	9	93
Column Total	50	45	46	141

STEP 2B: Calculate expected proportions assuming independence

Goal: calculate the values in table 2
First, calculate the marginal values assuming independence
- For each column or row, divide frequency by grand total to get expected proportion

Table 1. Observed Frequencies.

	Not Infected	Lightly Infected	Highly Infected	Row total
Eaten by birds	1	10	37	48
Not eaten by birds	49	35	9	93
Column Total	50	45	46	141

Table 2. Expected Proportions.

	Not Infected	Lightly Infected	Highly Infected	Proportion
Eaten by birds
Not eaten by birds
Proportion

Example 1: Probability of being not infected

\[ \hat{\operatorname{Pr}}[\text{Not infected}]= \]

\[ \frac{50}{141}= \]

\[ 0.3546 \]

Table 1. Observed Frequencies.

	Not Infected	Lightly Infected	Highly Infected	Row total
Eaten by birds	1	10	37	48
Not eaten by birds	49	35	9	93
Column Total	50	45	46	141

Table 2. Expected Proportions.

	Not Infected	Lightly Infected	Highly Infected	Proportion
Eaten by birds
Not eaten by birds
Proportion	0.3546

Example 2: Probability of not being eaten by birds

\[ \hat{\operatorname{Pr}}[\text{Eaten by birds}]= \]

\[ \frac{48}{141}= \]

\[ 0.3404 \]

Table 1. Observed Frequencies.

	Not Infected	Lightly Infected	Highly Infected	Row total
Eaten by birds	1	10	37	48
Not eaten by birds	49	35	9	93
Column Total	50	45	46	141

Table 2. Expected Proportions.

	Not Infected	Prop.
Eaten by birds		0.3404
Not eaten by birds
Proportion	0.3546

Repeat for all columns and rows

Table 1. Observed Frequencies.

	Not Infected	Lightly Infected	Highly Infected	Row total
Eaten by birds	1	10	37	48
Not eaten by birds	49	35	9	93
Column Total	50	45	46	141

Table 2. Expected Proportions.

	Not Infected	Lightly Infected	Highly Infected	Prop.
Eaten by birds				0.3404
Not eaten by birds				0.6596
Proportion	0.3546	0.3192	0.3262

Calculate cell proportions by multiplying marginal proportions

Use multiplication rule:

If two events are independent (null hypothesis), probability of both occurring is probability of one times probability of the other

\[ \operatorname{Pr}[\text{not infected and eaten}]\\ =\operatorname{Pr}[\text{not infected}]\times\operatorname{Pr}[\text{eaten}]\\ =0.3546 \times 0.3404 =0.1207 \]

Table 1. Observed Frequencies.

	Not Infected	Lightly Infected	Highly Infected	Row total
Eaten by birds	1	10	37	48
Not eaten by birds	49	35	9	93
Column Total	50	45	46	141

Table 2. Expected Proportions.

	Not Infected	Lightly Infected	Highly Infected	Prop.
Eaten by birds	0.1207			0.3404
Not eaten by birds				0.6596
Proportion	0.3546	0.3192	0.3262

Repeat for each cell proportion

Use multiplication rule:

If two events are independent (null hypothesis), probability of both occurring is probability of one times probability of the other

Table 1. Observed Frequencies.

	Not Infected	Lightly Infected	Highly Infected	Row total
Eaten by birds	1	10	37	48
Not eaten by birds	49	35	9	93
Column Total	50	45	46	141

Table 2. Expected Proportions.

	Not Infected	Lightly Infected	Highly Infected	Prop.
Eaten by birds	0.1207	0.1087	0.1110	0.3404
Not eaten by birds	0.2339	0.2105	0.2152	0.6596
Proportion	0.3546	0.3192	0.3262

STEP 2C: Calculate expected frequencies assuming independence

\[ \operatorname{Expected}[\text{not infected and eaten}]= \\\operatorname{Pr}[\text{not infected and eaten}]\times G= \\0.1207 \times 141 = \\17.0 \]

Repeat for each cell

Table 3. Expected Frequencies

	Not Infected
Eaten by birds	17.0
Not eaten by birds
		141

Table 2. Expected Proportions.

	Not Infected	Lightly Infected	Highly Infected
Eaten by birds	0.1207	0.1087	0.1110
Not eaten by birds	0.2339	0.2105	0.2152

STEP 2D: Calculate \(\chi^2\) test statistic

Use the tables for Observed and Expected frequencies to calculate the test statistic

\[ \chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]

Where:

\(r\) is the number of rows
\(c\) is the number of columns
\(O_{ij}\) is the observed frequency in row \(i\) and column \(j\)
\(E_{ij}\) is the expected frequency in row \(i\) and column \(j\)

Table 3. Expected Frequencies

	Not Infected	Lightly Infected	Highly Infected
Eaten by birds	17.0	15.3	15.7
Not eaten by birds	33.0	29.7	30.3

Table 1. Observed Frequencies.

	Not Infected	Lightly Infected	Highly Infected
Eaten by birds	1	10	37
Not eaten by birds	49	35	9

STEP 2D: Calculate \(\chi^2\) test statistic

Use the tables for Observed and Expected frequencies to calculate the test statistic

\[ \chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]

\[ = \frac{(1-17)^2}{17} + \frac{(49-33)^2}{33} + \\ \frac{(10-15.3)^2}{15.3} + \frac{(35-29.7)^2}{29.7} + \\ \frac{(37-15.7)^2}{15.7} + \frac{(9-30.3)^2}{30.3} \]

\[ = 69.5 \]

Table 3. Expected Frequencies

	Not Infected	Lightly Infected	Highly Infected
Eaten by birds	17.0	15.3	15.7
Not eaten by birds	33.0	29.7	30.3

Table 1. Observed Frequencies.

	Not Infected	Lightly Infected	Highly Infected
Eaten by birds	1	10	37
Not eaten by birds	49	35	9

STEPS 3-4: \(P\)-value and Interpretation

Decide a priori (beforehand) on a significance level, e.g. \(\alpha=0.05\), then…

Method 1 - exact P-value

Calculate exact \(P\)-value using computer
If \(P<\alpha\) then reject \(H_0\)

Method 2 - statistical table

Calculate degrees of freedom

\(\operatorname{df}=(r-1)(c-1)\)
Look up critical value in a table
If \(\chi^2_{df}>\operatorname{critical value}\) then reject \(H_0\)

How to look up a critical value in a chi-square table

See any chi-square distribution table
Know your 𝑑𝑓 and 𝛼
For example, at 𝑑𝑓=2 𝛼=0.05, The critical value is Χ^2=5.991

Decision rule: \(\chi^2_{df}>\operatorname{critical value}\)

\[ \chi^2 = 69.5 \]

\[ \operatorname{critical value} = 5.994 \]

Therefore, reject \(H_0\), conclude parasite infection level and being eaten are not independent

Wrapping Up: Contingency Analysis

Chi-square test of independence compares observed vs. expected frequencies
- Expected counts computed from row and column totals
When assumptions are strained
- Yates correction: continuity adjustment sometimes recommended for 2 × 2 tables
- Fisher’s exact test: preferred when expected counts are small (any expected count < 5)
Alternative framework
- G-test (likelihood ratio test)
  - Based on likelihood ratios rather than squared deviations
  - Extends more naturally to complex models and multiple explanatory variables
  - Still sensitive to small expected counts
Choose the method based on sample size, expected counts, and study design
Always interpret results in the biological context of the association

Lecture 12 Contingency Analysis

Learning Objectives

Contingency Analysis: Relationships Between Categorical Variables

Example: Does knowing a person’s sex tell you anything about the chance they survived the Titanic?

Two Complementary Goals: Estimation and Hypothesis Testing Answer Difference Questions

1️⃣ Estimation

2️⃣ Hypothesis Testing

What is a Contingency Table?

Example Study: Aspirin and Cancer Risk

Conditional Probability (Risk)

Example: Risk of Cancer for each Group

Relative Risk (RR)

Odds: A Different Way to Express Probability

Odds of developing cancer while taking aspirin

Odds of developing cancer while taking placebo

Odds Ratio (OR)

Odds ratio shortcut

Standard error for odds ratio

Confidence interval for odds ratio

Odds ratio interpretation

Relative Risk vs. Odds Ratio

Case-Control Study

Cohort Study

The \(\chi^2\) contingency test

The \(\chi^2\) contingency test is the most commonly used test of association between two categorical

Example: Consider the life cycle of the trematode E. californensis

Research on fish behavior

Observed frequencies of fish eaten or not eaten by birds according to trematode infection level

Steps to hypothesis testing

STEP 1: State the hypotheses

STEP 2: Calculate the test statistic ( \(\chi^2\) )

STEP 2A: Calculate row, column, and grand totals

STEP 2B: Calculate expected proportions assuming independence

Example 1: Probability of being not infected

Example 2: Probability of not being eaten by birds

Repeat for all columns and rows

Calculate cell proportions by multiplying marginal proportions

Repeat for each cell proportion

STEP 2C: Calculate expected frequencies assuming independence

STEP 2D: Calculate \(\chi^2\) test statistic

STEP 2D: Calculate \(\chi^2\) test statistic

STEPS 3-4: \(P\)-value and Interpretation

How to look up a critical value in a chi-square table

Wrapping Up: Contingency Analysis

Lecture 12
Contingency Analysis