BIOL 275 Biostatistics – quarto-input74f54525e6762108

Learning Objectives

Identify when data deviate from normality
Interpret histograms and normal quantile (Q–Q) plots
Describe when it is reasonable to ignore assumption violations
Explain the purpose of data transformations
Recognize common transformations (e.g., log, square root)
Describe nonparametric alternatives to \(t\)-tests
Choose an appropriate approach when assumptions are violated

The animals haven’t read the books

Many methods assume approximate normality of the data, especially for small sample sizes.
Often frequency distributions aren’t normal, and variances aren’t equal
What to do?

Bright orange sea star with patterned arms resting on a textured coral reef surface. — Figure 1: Sea star on a coral reef, illustrating natural biological variation that may not follow ideal statistical assumptions.

Options when assumptions are violated

Use a method that is robust to the violation
- Many tests (e.g., \(t\)-tests) tolerate moderate departures from normality, especially with larger sample sizes
Transform the data
- Apply a mathematical transformation (e.g., log, square root) to better meet assumptions
- Can improve symmetry and variance
Use a nonparametric method
- Makes fewer assumptions about the distribution
- Often based on ranks rather than raw values

Use resampling methods
1. Permutation tests or bootstrap methods
2. Do not rely on normality assumptions
Use a method designed for that type of data
- Counts (e.g., number of individuals)
- Proportions (e.g., survival rate)
- Binary outcomes (0/1)
- These often require different statistical approaches than methods for normal data
- Discussed later in lecture on multiple regression

Detecting deviations from normality

Graphical methods (eyeball test)

Plot the data
Histograms of variables
Split into categories if you have them
Make a normal quantile plot

Figure 2: Histograms of data, all sampled from a perfect normal distribution using a computer. None resembles a normal distribution, but they are not so far off that we should give up on the assumption of normality.

Examples of samples drawn from non-normally-distributed populations

Not all data follow a symmetric, bell-shaped distribution
Common deviations:
- Skewness (long tail left or right)
- Heavy tails or outliers
- Irregular or uneven shapes
These deviations can affect statistical methods that assume normality
Next: how to detect and handle these situations

Four panels showing different data distributions that deviate from normality, including skewed and uneven shapes rather than a symmetric bell curve. — Figure 3: Examples of samples drawn from non-normal distributions, showing skewness and departures from symmetry.

Normal quantile (Q–Q) plots identify non-normality

Each point = one observation
- X-axis: expected values if the data were normal with the same mean and standard deviation
- Y-axis: observed values, sorted from smallest to largest
Interpretation:
- Points on a straight line → approximately normal
- Systematic curvature → non-normal (e.g., skew, heavy tails)
- Points far from line → potential outliers

Scatterplot of sample quantiles versus theoretical normal quantiles. Points lie close to a diagonal reference line, indicating approximately normal data. — Figure 4: Normal quantile (Q–Q) plot for data sampled from a normal distribution. Points fall near a straight line, indicating that the data are approximately normal.

Example: biomass ratio

Halpern (2003) calculated biomass ratio as the total mass of all marine plants and animals per unit area of reserve divided by the same quantity in an unprotected control.

Figure 5: Brightly colored intertidal crab, an example of a species whose biomass may differ between protected and unprotected habitats. Whitlock & Schluter. *The Analysis of Biological Data 3e*. © 2020 W. H. Freeman and Company.

Top panel shows a histogram of biomass ratio values with a long right tail. Bottom panel shows a Q–Q plot where points curve away from a straight line, indicating non-normal data. — Figure 6: Histogram and normal quantile (Q–Q) plot of biomass ratio data showing right-skewness and deviation from normality.

Formal test of normality

A Shapiro-Wilk test evaluates the goodness-of-fit of a normal distribution to a sample.
Hypotheses:
- H_A: The data are sampled from a population not having a normal distribution
- H₀: The data are samples from a population having a normal distribution.

Warnings:

A small sample size might not yield enough power to reject the null even when data are non-normal
With large samples, even small deviations from normality can lead to rejection.
BUT, as sample size increases, assumption of normality becomes less important
Take-home: plot the data and use common sense

Option 1: When to ignore violations of assumptions

Ignore violations of assumptions when:
- The violations are not extreme
- A statistical procedure is robust
Robust procedures:
- Answer is not sensitive to violations of the assumptions (some methods are considered robust, others not, learn which)
- Not unduly affected by outliers
- Provide good performance even with small departures from normal distributions

\(t\)-test is less robust when…
- sample sizes are small
- there are outliers
- two distributions differ from normality in different ways
- variances are unequal (pooled-variance \(t\)-test)
For comparing variances:
- \(F\)-test is not robust to non-normality
- Levene’s test is

Option 2: Data transformation

Sometimes we can take data and transform it so it better meets assumptions
A data transformation changes each measurement by the same mathematical formula
Use a “prime” mark ( \(\prime\) ) to denote transformed data e.g. \(Y^\prime\)
- Pronounced “Y-prime”

Skewed histogram on the left with a crying emoji, arrow pointing to a bell-shaped histogram on the right with a heart-eyes emoji. — Figure 7: Transforming non-normal data toward normality.

The log transformation

The log transformation is the most common transformation in biology
Procedure:
- Take the natural log of each measurement

\[ Y^\prime = \operatorname{ln}[Y] \]

R code:

# transform a vector
transformed <- log(y)

# transform a variable in a dataset
data |> 
  mutate(transformed = log(y))

Two panels comparing distributions before and after log transformation; left panel shows two right-skewed distributions, while the right panel shows their log-transformed versions appearing more symmetric and bell-shaped. — Figure 8: Log transformation makes right-skewed distributions more symmetric and stabilizes differences between groups.

When to use the log transformation

Most likely to be useful when:
- Measurements are ratios or products of variables
- Frequency distribution of data is skewed right (long tail on right)
- The group having the larger mean (when comparing two groups) also has the higher standard deviation
- The data span several orders of magnitude

Commonly used for:
- Measurements such as body size and body mass
- Count data (e.g. number of individual organisms in an area)
- If data contain zeros, you can’t take log. Sometimes a small constant (e.g., +1) is added, but this should be justified.

\[ Y^\prime = \operatorname{ln}[Y+1] \]

R code:

p_prime <- log1p(p)

Other transformations

Arcsine square root transformation used on data that are proportions

\[ p\prime = \operatorname{arcsin}[\sqrt{p}] \]

Square root transformation used on count data (more info)

\[ Y^\prime = \sqrt{Y} \]

For left-skewed data try the square transformation or antilog transformation

\[ Y^\prime = Y^2 \]

\[ Y^\prime = e^Y \]

For right-skewed data try the reciprocal transformation

\[ Y^\prime = \frac{1}{Y} \]

Confidence interval for the mean of a non-normally distributed variable

Standard t-based confidence intervals assume approximate normality, especially for small samples
Steps for non-normal data:
- Transform the raw data
- Calculate confidence intervals using methods already discussed
- Back-transform the upper and lower confidence limits so they are on the original scale of the raw data
Example: if you used log transformation before then back-transform using antilog

Back-transformations

Transform	Back-transform
Log	Antilog
Arcsine square root	Sine square \((\operatorname{sin}[Y^\prime])^2\)
Square root	Square
Square	Square root
Antilog	Log
Reciprocal	Reciprocal

Other considerations when transforming data

Interpretability can suffer (values no longer in original units, harder to explain to others)
Some transforms do not work well with zero, small numbers, or negative numbers
- Shifting values before transforming can help
- E.g. add 1 before log transform, add 0.5 before square root transform

Avoid multiple testing
- Do not just try different transformations until you find one that gives a \(p\)-value smaller than 0.05
- This increases your chance of a Type I error
- Instead, decide a priori (ahead of time) which transformation best yields data that meet the assumptions of the statistical method

Option 3: Nonparametric alternatives

Parametric methods

Assume normality

Nonparametric methods

Make fewer assumptions about the distribution of the variables
- e.g. data does not have to be normal
Usually based on ranks of the data points, not their actual values

Alternatives to one-sample \(t\)-tests

Use these when data are not approximately normal (for one-sample or paired comparisons).

Sign test

Tests whether the median differs from a null value
Uses only the direction (+/–) of differences
Makes no assumptions about the distribution
Low power (less sensitive to differences)

Wilcoxon signed-rank test

Tests whether the median differs from a null value
Uses ranks of differences (accounts for magnitude)
Assumes symmetric distribution of differences
More powerful than the sign test

Alternative to two-sample \(t\)-tests

Wilcoxon Rank-Sum Test (aka Mann-Whitney \(U\)-test) compares the distributions of two groups

Compares the distributions of two groups
Uses ranks instead of raw values
Requires fewer assumptions than a two-sample \(t\)-test
Does not assume normality

Sensitive to differences in location, spread, and shape
If distributions have similar shape → can be interpreted as a test of medians
If distributions differ in shape → difficult to attribute differences to a single cause

Nonparametric tests are typically less powerful than parametric tests

Option 4: Resampling methods (no normality required)

Do not assume normality
Use the data to build distributions (null or sampling)
Widely used in modern statistics

Permutation tests

Test hypotheses by simulating the null hypothesis using the data
Shuffle labels (e.g., group membership) to represent “no effect”
Recalculate the statistic many times → build a null distribution
p-value = how extreme the observed result is relative to this distribution

Bootstrap

Estimate uncertainty by resampling the data (with replacement)
Recalculate the statistic many times
Use the distribution to estimate standard errors and confidence intervals

Option 5: Use methods designed for the type of data

Some types of data require different models:
- Counts (e.g., number of individuals) → Poisson or negative binomial models
- Proportions (e.g., survival rate) → Binomial models
- Binary outcomes (0/1) → Logistic regression
These are examples of generalized linear models (GLMs)
- Specify a distribution for the response variable
- Use a link function (e.g., log, logit) to relate variables
Avoid forcing normality with transformations when a suitable model exists

Flowchart showing predictor variables feeding into a linear predictor, then through a link function to the mean of the response, which is associated with a distribution and produces the response variable. — Figure 9: Diagram of a generalized linear model showing how predictors are combined into a linear predictor, linked to the mean of the response, and connected to an appropriate distribution for the data.

Lecture 16 Handling Violations of Assumptions