Lecture 16
Handling Violations of Assumptions

ABD 3e Chapter 13

Chris Merkord

Learning Objectives

  • Identify when data deviate from normality
  • Interpret histograms and normal quantile (Q–Q) plots
  • Describe when it is reasonable to ignore assumption violations
  • Explain the purpose of data transformations
  • Recognize common transformations (e.g., log, square root)
  • Describe nonparametric alternatives to \(t\)-tests
  • Choose an appropriate approach when assumptions are violated

The animals haven’t read the books

  • Many methods assume approximate normality of the data, especially for small sample sizes.

  • Often frequency distributions aren’t normal, and variances aren’t equal

  • What to do?

Bright orange sea star with patterned arms resting on a textured coral reef surface.
Figure 1: Sea star on a coral reef, illustrating natural biological variation that may not follow ideal statistical assumptions.

Options when assumptions are violated

  1. Use a method that is robust to the violation
    • Many tests (e.g., \(t\)-tests) tolerate moderate departures from normality, especially with larger sample sizes
  2. Transform the data
    • Apply a mathematical transformation (e.g., log, square root) to better meet assumptions
    • Can improve symmetry and variance
  3. Use a nonparametric method
    • Makes fewer assumptions about the distribution
    • Often based on ranks rather than raw values
  1. Use resampling methods
    1. Permutation tests or bootstrap methods
    2. Do not rely on normality assumptions
  2. Use a method designed for that type of data
    • Counts (e.g., number of individuals)
    • Proportions (e.g., survival rate)
    • Binary outcomes (0/1)
    • These often require different statistical approaches than methods for normal data
    • Discussed later in lecture on multiple regression

Detecting deviations from normality

Graphical methods (eyeball test)

  • Plot the data

  • Histograms of variables

  • Split into categories if you have them

  • Make a normal quantile plot

Figure 2: Histograms of data, all sampled from a perfect normal distribution using a computer. None resembles a normal distribution, but they are not so far off that we should give up on the assumption of normality.

Examples of samples drawn from non-normally-distributed populations

  • Not all data follow a symmetric, bell-shaped distribution

  • Common deviations:

    • Skewness (long tail left or right)
    • Heavy tails or outliers
    • Irregular or uneven shapes
  • These deviations can affect statistical methods that assume normality

  • Next: how to detect and handle these situations

Four panels showing different data distributions that deviate from normality, including skewed and uneven shapes rather than a symmetric bell curve.
Figure 3: Examples of samples drawn from non-normal distributions, showing skewness and departures from symmetry.

Normal quantile (Q–Q) plots identify non-normality

  • Each point = one observation

    • X-axis: expected values if the data were normal with the same mean and standard deviation

    • Y-axis: observed values, sorted from smallest to largest

  • Interpretation:

    • Points on a straight line → approximately normal
    • Systematic curvature → non-normal (e.g., skew, heavy tails)
    • Points far from line → potential outliers
Scatterplot of sample quantiles versus theoretical normal quantiles. Points lie close to a diagonal reference line, indicating approximately normal data.
Figure 4: Normal quantile (Q–Q) plot for data sampled from a normal distribution. Points fall near a straight line, indicating that the data are approximately normal.

Example: biomass ratio

Halpern (2003) calculated biomass ratio as the total mass of all marine plants and animals per unit area of reserve divided by the same quantity in an unprotected control.

Figure 5: Brightly colored intertidal crab, an example of a species whose biomass may differ between protected and unprotected habitats. Whitlock & Schluter. The Analysis of Biological Data 3e. © 2020 W. H. Freeman and Company.
Top panel shows a histogram of biomass ratio values with a long right tail. Bottom panel shows a Q–Q plot where points curve away from a straight line, indicating non-normal data.
Figure 6: Histogram and normal quantile (Q–Q) plot of biomass ratio data showing right-skewness and deviation from normality.

Formal test of normality

  • A Shapiro-Wilk test evaluates the goodness-of-fit of a normal distribution to a sample.

  • Hypotheses:

    • HA: The data are sampled from a population not having a normal distribution

    • H0: The data are samples from a population having a normal distribution.

Warnings:

  • A small sample size might not yield enough power to reject the null even when data are non-normal

  • With large samples, even small deviations from normality can lead to rejection.

  • BUT, as sample size increases, assumption of normality becomes less important

  • Take-home: plot the data and use common sense

Option 1: When to ignore violations of assumptions

  • Ignore violations of assumptions when:
    • The violations are not extreme
    • A statistical procedure is robust
  • Robust procedures:
    • Answer is not sensitive to violations of the assumptions (some methods are considered robust, others not, learn which)

    • Not unduly affected by outliers

    • Provide good performance even with small departures from normal distributions

  • \(t\)-test is less robust when…
    • sample sizes are small
    • there are outliers
    • two distributions differ from normality in different ways
    • variances are unequal (pooled-variance \(t\)-test)
  • For comparing variances:
    • \(F\)-test is not robust to non-normality

    • Levene’s test is

Option 2: Data transformation

  • Sometimes we can take data and transform it so it better meets assumptions

  • A data transformation changes each measurement by the same mathematical formula

  • Use a “prime” mark ( \(\prime\) ) to denote transformed data e.g. \(Y^\prime\)

    • Pronounced “Y-prime”

Skewed histogram on the left with a crying emoji, arrow pointing to a bell-shaped histogram on the right with a heart-eyes emoji.
Figure 7: Transforming non-normal data toward normality.

The log transformation

  • The log transformation is the most common transformation in biology

  • Procedure:

    • Take the natural log of each measurement

\[ Y^\prime = \operatorname{ln}[Y] \]

R code:

# transform a vector
transformed <- log(y)

# transform a variable in a dataset
data |> 
  mutate(transformed = log(y))
Two panels comparing distributions before and after log transformation; left panel shows two right-skewed distributions, while the right panel shows their log-transformed versions appearing more symmetric and bell-shaped.
Figure 8: Log transformation makes right-skewed distributions more symmetric and stabilizes differences between groups.

When to use the log transformation

  • Most likely to be useful when:

    • Measurements are ratios or products of variables

    • Frequency distribution of data is skewed right (long tail on right)

    • The group having the larger mean (when comparing two groups) also has the higher standard deviation

    • The data span several orders of magnitude

  • Commonly used for:

    • Measurements such as body size and body mass

    • Count data (e.g. number of individual organisms in an area)

    • If data contain zeros, you can’t take log. Sometimes a small constant (e.g., +1) is added, but this should be justified.

\[ Y^\prime = \operatorname{ln}[Y+1] \]

R code:

p_prime <- log1p(p)

Other transformations

  • Arcsine square root transformation used on data that are proportions

\[ p\prime = \operatorname{arcsin}[\sqrt{p}] \]

  • Square root transformation used on count data (more info)

\[ Y^\prime = \sqrt{Y} \]

  • For left-skewed data try the square transformation or antilog transformation

\[ Y^\prime = Y^2 \]

\[ Y^\prime = e^Y \]

  • For right-skewed data try the reciprocal transformation

\[ Y^\prime = \frac{1}{Y} \]

Confidence interval for the mean of a non-normally distributed variable

  • Standard t-based confidence intervals assume approximate normality, especially for small samples
  • Steps for non-normal data:
    • Transform the raw data
    • Calculate confidence intervals using methods already discussed
    • Back-transform the upper and lower confidence limits so they are on the original scale of the raw data
  • Example: if you used log transformation before then back-transform using antilog

Back-transformations

Transform Back-transform
Log Antilog
Arcsine square root Sine square \((\operatorname{sin}[Y^\prime])^2\)
Square root Square
Square Square root
Antilog Log
Reciprocal Reciprocal

Other considerations when transforming data

  • Interpretability can suffer (values no longer in original units, harder to explain to others)

  • Some transforms do not work well with zero, small numbers, or negative numbers

    • Shifting values before transforming can help
    • E.g. add 1 before log transform, add 0.5 before square root transform
  • Avoid multiple testing

    • Do not just try different transformations until you find one that gives a \(p\)-value smaller than 0.05

    • This increases your chance of a Type I error

    • Instead, decide a priori (ahead of time) which transformation best yields data that meet the assumptions of the statistical method

Option 3: Nonparametric alternatives

Parametric methods

  • Assume normality

Nonparametric methods

  • Make fewer assumptions about the distribution of the variables

    • e.g. data does not have to be normal
  • Usually based on ranks of the data points, not their actual values

Alternatives to one-sample \(t\)-tests

Use these when data are not approximately normal (for one-sample or paired comparisons).

Sign test

  • Tests whether the median differs from a null value
  • Uses only the direction (+/–) of differences
  • Makes no assumptions about the distribution
  • Low power (less sensitive to differences)

Wilcoxon signed-rank test

  • Tests whether the median differs from a null value
  • Uses ranks of differences (accounts for magnitude)
  • Assumes symmetric distribution of differences
  • More powerful than the sign test

Alternative to two-sample \(t\)-tests

Wilcoxon Rank-Sum Test (aka Mann-Whitney \(U\)-test) compares the distributions of two groups

  • Compares the distributions of two groups
  • Uses ranks instead of raw values
  • Requires fewer assumptions than a two-sample \(t\)-test
  • Does not assume normality
  • Sensitive to differences in location, spread, and shape
  • If distributions have similar shape → can be interpreted as a test of medians
  • If distributions differ in shape → difficult to attribute differences to a single cause

Nonparametric tests are typically less powerful than parametric tests

Option 4: Resampling methods (no normality required)

  • Do not assume normality
  • Use the data to build distributions (null or sampling)
  • Widely used in modern statistics

Permutation tests

  • Test hypotheses by simulating the null hypothesis using the data
  • Shuffle labels (e.g., group membership) to represent “no effect”
  • Recalculate the statistic many times → build a null distribution
  • p-value = how extreme the observed result is relative to this distribution

Bootstrap

  • Estimate uncertainty by resampling the data (with replacement)
  • Recalculate the statistic many times
  • Use the distribution to estimate standard errors and confidence intervals

Option 5: Use methods designed for the type of data

  • Some types of data require different models:

    • Counts (e.g., number of individuals) → Poisson or negative binomial models
    • Proportions (e.g., survival rate) → Binomial models
    • Binary outcomes (0/1) → Logistic regression
  • These are examples of generalized linear models (GLMs)

    • Specify a distribution for the response variable
    • Use a link function (e.g., log, logit) to relate variables
  • Avoid forcing normality with transformations when a suitable model exists

Flowchart showing predictor variables feeding into a linear predictor, then through a link function to the mean of the response, which is associated with a distribution and produces the response variable.
Figure 9: Diagram of a generalized linear model showing how predictors are combined into a linear predictor, linked to the mean of the response, and connected to an appropriate distribution for the data.