BIOL 275 Biostatistics – quarto-inputb24dc0e85a86b26c

Learning Objectives

Identify the explanatory variable and response variable in a regression study
Recognize when a scatterplot suggests an approximately linear relationship between two numerical variables
Explain how the least squares regression line summarizes the relationship between variables
Interpret the slope, intercept, and predicted values of a regression equation in context
Distinguish confidence bands for the mean response from prediction intervals for individual outcomes
Interpret the slope test, \(R^2\), and other evidence beyond the p-value
Use residual plots to assess linearity, constant variance, and unusual observations

What is regression?

Regression is used to predict values of one numerical variable from values of another numerical variable.
A regression line summarizes the relationship between two variables in a scatter plot.
It can be used to estimate the expected value of a response variable from an explanatory variable.
In this chapter, we focus on linear regression, where the relationship is described by a straight line.

Regression example

Genetic diversity in human populations is related to distance from East Africa.
A fitted line can be used to predict expected genetic diversity from migration distance.
The downward slope suggests diversity decreases as distance increases.

Black-and-white photograph showing two views of a reconstructed fossil human skull from Herto, Ethiopia. The left image shows a side view with an elongated cranial vault and facial bones; the right image shows a front view with missing portions and visible reconstruction gaps. The specimen is displayed against a dark background. — Figure 1: Fossil crania from Herto, Ethiopia, dated to approximately 160,000–154,000 years ago, representing early Homo sapiens from East Africa. Included here as contextual evidence for human origins relevant to geographic-distance examples in regression. Source: White et al. (2003), Nature.

Scatter plot of human populations showing geographic distance from East Africa on the x-axis and genetic diversity (Hs) on the y-axis. Populations are colored by region: Africa, Europe, Asia, Middle East, Oceania, and the Americas. Most African populations cluster at short distances with high diversity near 0.78, while populations farther from East Africa, especially in the Americas, show lower diversity. A downward sloping regression line indicates decreasing diversity with increasing distance. — Figure 2: Genetic diversity declines with increasing geographic distance from East Africa across sampled human populations. Points represent populations grouped by world region; line shows the fitted linear regression. Source: Prugnolle et al. (2005), Current Biology.

Regression vs correlation

Both methods describe relationships between two numerical variables.
Correlation measures strength and direction of association.
Regression fits an equation to predict one variable from another.
Regression also measures how much the response changes when the explanatory variable changes.

Two variables in regression

Response variable (\(Y\)): the outcome we want to predict.
Explanatory variable (\(X\)): the variable used to explain or predict \(Y\).
In scatter plots for regression:
- \(X\) is placed on the horizontal axis.
- \(Y\) is placed on the vertical axis.

Blank scatterplot with no plotted data points. The horizontal axis is labeled X (Explanatory Variable), and the vertical axis is labeled Y (Response Variable), illustrating the standard axis arrangement used in regression. — Figure 3: Axes used in regression scatterplots, with the explanatory variable on the horizontal axis and the response variable on the vertical axis.

Study designs for regression

Regression can be used with observational data.
Example: randomly sample individuals and measure both \(X\) and \(Y\).
Regression can also be used in experiments.
Example: choose treatment levels of \(X\), then measure response \(Y\).

Linear regression

The most common type of regression is linear regression.
It fits a straight line through data to predict \(Y\) from \(X\).
A key assumption is that the true relationship is approximately linear.
If the relationship is strongly curved, a straight line may be inappropriate.

Scatterplot with eight red data points rising from lower left to upper right. The horizontal axis is labeled X (Explanatory Variable) and the vertical axis is labeled Y (Response Variable). A gray straight line passes through the points, indicating a positive linear trend with modest scatter around the line. — Figure 4: Example scatterplot showing a positive linear relationship between an explanatory variable and a response variable, with a fitted regression line.

Example 17.1: The lion’s nose

Managers want to estimate the ages of male lions.
Older males may be removed with less disruption than younger males.
Black pigmentation on the nose increases with age.
We use proportion black on the nose to predict age.

Figure 5: Photo of a male African Lion (*Panthera leo*). Image: Clément Bardot (CC BY-SA 4.0)

Lion data

Data from 32 male lions of known age.
\(X\) = proportion black on the nose.
\(Y\) = age (years).
A scatter plot is the first step.

Scatterplot of lion age in years versus proportion black on the nose. Points show individual lions, with ages generally increasing as nose pigmentation increases. — Figure 6: Age of 32 male lions plotted against the proportion of black pigmentation on the nose. Data from Whitlock & Schluter 3e.

The method of least squares

Many lines can be drawn through a scatter plot.
We need a rule for choosing the best line.
Least squares chooses the line with the smallest total squared vertical deviations from the points.
These vertical deviations are called residuals.

Comparing possible lines

Poorly chosen lines have large deviations from the data.
Better lines have smaller deviations.
The least squares line minimizes the sum of squared deviations.

Three-panel figure using the same lion age data in each panel. Each panel shows age in years versus proportion black on the nose, a candidate straight line, and vertical segments from each point to the line. The left panel has a poorly fitting line with large deviations, the middle panel has a line with smaller deviations, and the right panel has the least-squares line with the smallest deviations overall. — Figure 7: Lion age data shown with three candidate regression lines that produce large, smaller, and smallest vertical deviations from the observed points.

Why square the deviations?

Points above the line have positive residuals.
Points below the line have negative residuals.
If we simply added deviations, positives and negatives could cancel.
Squaring avoids cancellation and gives more weight to large errors.

Formula for the line

A regression line is written as:

\[ Y = a + bX \]

\(a\) is the intercept.
\(b\) is the slope.
Together, they determine the location and tilt of the line.

The intercept and slope

Intercept : the predicted value of \(Y\) when \(X = 0\)
- Where the line crosses the \(y\)-axis
- Units are same as response variable
Slope : the change in \(Y\) for a one-unit increase in \(X\).
- Positive slope: larger \(X\) predicts larger \(Y\).
- Negative slope: larger \(X\) predicts smaller \(Y\).
- Zero slope: no linear trend.
- The rate of change in \(Y\) per unit of \(X\).

Four-panel figure. The first panel shows points around an upward sloping line labeled b positive. The second shows points around a downward sloping line labeled b negative. The third shows points around a horizontal line labeled b = 0. The fourth shows two parallel upward sloping lines labeled a higher and a lower, with filled points for the higher intercept and open points for the lower intercept. — Figure 8: Example scatterplots showing positive, negative, and zero slopes, and a comparison of higher and lower intercepts.

Calculating the slope

For a sample, the least squares slope is:

\[ b = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sum (X_i - \bar{X})^2} \]

The numerator measures whether \(X\) and \(Y\) tend to increase or decrease together.
The denominator measures how much the \(X\) values vary.
Their ratio determines the slope of the fitted regression line.

Calculating the intercept

Once the slope is known, the intercept can be calculated as:

\[ a = \bar{Y} - b\bar{X} \]

The fitted regression line always passes through the point:

\[ (\bar{X}, \bar{Y}) \]

This means the line passes through the sample means of \(X\) and \(Y\).

Lion regression equation

For the lion data, the fitted line is approximately:

\[ \hat{Y} = 0.88 + 10.65X \]

Equivalent interpretation:

\[ \widehat{\text{Age}} = 0.88 + 10.65(\text{proportion black}) \]

Scatterplot of lion age in years versus proportion black on the nose. Blue points represent individual lions. A gray upward sloping regression line is shown, with the equation y = 0.88 + 10.65x displayed inside the plot. Ages tend to increase as nose pigmentation increases. — Figure 9: Lion age plotted against the proportion of black pigmentation on the nose, with the fitted linear regression line and equation.

Interpreting the lion slope

Each 1.0 increase in proportion black predicts about 10.65 more years of age.
More practically, a 0.10 increase predicts about:

\[ 10.65 \times 0.10 = 1.065 \]

years older on average.

As shown in Figure 9, age increases with pigmentation.

Lion fitted line

The line summarizes the average trend.
Individual lions vary around the line.
Not every point falls exactly on the prediction line.

Populations and samples

The fitted line from data is a sample estimate.
We use it to estimate the true regression line in the population.
Population form:

\[ Y = \alpha + \beta X \]

Sample estimates:
- \(a\) estimates \(\alpha\)
- \(b\) estimates \(\beta\)

Predicted values

A predicted value lies on the regression line.
We use:

\[ \hat{Y} \]

for a prediction.
\(\hat{Y}\) estimates the mean response for all individuals with a given \(X\) value.

Lion prediction example

Predict age when proportion black is \(X = 0.50\):

\[ \hat{Y} = 0.88 + 10.65(0.50) = 6.2 \]

Lions with \(X = 0.50\) are predicted to average about 6.2 years old.

Use caution with prediction

Predictions most reliable within the observed range of \(X\) values
Predicting far outside the data range is called extrapolation
May misleading because relationship may change beyond sampled data

Diagram showing observed data clustered on the left side of a scatterplot, with a fitted blue linear regression line closely matching the data range. A red curved line represents the true underlying relationship. Far to the right, a new predictor value X* is outside the observed data range. The blue regression line gives a high predicted response there, while the red true curve is much lower, illustrating how extrapolation beyond the data can lead to inaccurate predictions. — Figure 10: Extrapolation can produce misleading predictions when a linear regression line is extended beyond the observed data range. Image: Kostia, via Stack Exchange

Figure 11: Very serious examples of extrapolation gone awry. Images: xkcd (CC BY-NC 2.5)

Confidence in predictions

A regression line gives the best predicted mean value of \(Y\) for each value of \(X\).
Predictions are most reliable within the observed range of the explanatory variable.
Every prediction includes uncertainty because the fitted line was estimated from sample data.
Two common goals:
- predict the mean response at a given \(X\)
- predict a single future observation at that \(X\)

Mean response vs individual response

These two questions are different:
- What is the average value of \(Y\) when \(X= x\)?
- What value of \(Y\) might one individual case have when \(X= x\)?
Both use the same regression line.
The prediction for an individual is less precise because individuals vary around the line.

Confidence bands vs prediction intervals

Confidence band : interval for the mean response

narrower
uncertainty in the fitted line
Used when studying the overall trend.

Prediction interval : interval for a single observation

wider
includes uncertainty in the line plus individual variation
Used when predicting one case.

Two-panel figure using the lion nose data. The left panel shows the scatterplot with the fitted regression line and dashed confidence bands for the mean predicted age. The right panel shows the same scatterplot with the fitted regression line and wider dashed prediction intervals for individual lion ages. — Figure 12: Confidence bands and prediction intervals for lion age predicted from the proportion of black pigmentation on the nose.

Uncertainty increases near the edges

Confidence bands are narrowest near \(\bar{X}\), the center of the observed \(X\) values.
Confidence bands widen toward the smallest and largest observed values of \(X\).
There is less information about the mean response at the extremes of the data.
Prediction intervals are wider overall because they must also include individual variation.
Prediction intervals may widen somewhat near the edges, but this is often less noticeable.

Testing whether the slope is zero

A common hypothesis test in regression asks whether the population slope equals zero.

\[ H_0:\beta = 0 \]

\[ H_A:\beta \ne 0 \]

If \(\beta = 0\), there is no linear relationship between \(X\) and the mean of \(Y\)
If the p-value is small, we reject \(H_0\)

(small = less than significance level \(\alpha\) )
Usually called the t-test for the slope of the regression line
Generated automatically in R when you use summary(lm_object)

Interpreting the slope test

A significant result (\(p<\alpha\)) suggests evidence that \(X\) is linearly associated with \(Y\) and may help predict \(Y\).
It does not establish causation.
Statistical significance does not necessarily mean a strong relationship or a large slope.
A small slope can be statistically significant if the sample size is large.

Two-panel scatterplot. The left panel shows a small sample with a strong positive linear relationship and fitted regression line. The right panel shows a very large sample with a weak positive linear relationship and fitted regression line. Both panels include p-values from linear regression tests. — Figure 13: Two example datasets showing that statistical significance can occur with either a strong relationship in a small sample or a weak relationship in a very large sample.

Beyond the p-value

A \(p\)-value alone does not describe the size, precision, or practical importance of a relationship.
Interpret the slope test together with other evidence from the model and the data.
Always examine:
- slope estimate (effect size)
- confidence interval (precision)
- scatterplot (pattern, outliers, linearity)
- practical importance (real-world relevance)

Using \(R^2\) to measure fit

\(R^2\) measures the proportion of variation in \(Y\) explained by the regression model

\[ 0 \le R^2 \le 1 \]

Larger values mean points lie closer to the fitted line
Smaller values mean more unexplained scatter
Example: \(R^2 = 0.22\) means 22% of variation is explained

Do not confuse these regression statistics

Correlation coefficient (\(r\))
- measures strength and direction of a linear relationship
- ranges from \(-1\) to \(1\)
Coefficient of determination (\(R^2\))
- measures the proportion of variation in \(Y\) explained by the model
- ranges from \(0\) to \(1\)
P-value for the slope
- tests whether the population slope differs from zero
- does not measure effect size or model fit
These statistics are related, but they answer different questions.

Assumptions of linear regression

Relationship between \(X\) and mean \(Y\) is approximately linear
Residuals have similar spread across values of \(X\)
Observations are independent
Residuals are approximately normal (mainly important for tests and intervals)
Use graphs to evaluate assumptions

Schematic figure showing four vertical violin-shaped distributions of Y at four X values. The centers of the distributions lie on an upward sloping regression line, and all four distributions have the same spread, illustrating linearity, normality of Y at each X, and equal variance. — Figure 14: Illustration of the assumptions of linear regression: at each value of X, the response variable is normally distributed around the true regression line with equal variance.

Residual plots help assess regression model assumptions

A residual is the difference between the observed value and the predicted value:

\[ e_i = y_i - \hat{y}_i \]

Positive residuals: points above the regression line
Negative residuals: points below the line
Plot residuals against \(X\) or against fitted values (\(\hat{y}\))
Help assess linearity, constant variance, and unusual observations

Two-panel figure of residuals versus explanatory variable X. The left panel shows residuals scattered randomly around zero with a similar vertical spread across the full range of X, consistent with constant variance. The right panel shows residuals with a funnel shape: residuals have wide spread at low X values and narrower spread at high X values, indicating unequal variance. — Figure 15: Example residual plots used to assess model assumptions. In the left panel, residuals are centered around zero with a similar vertical spread across values of X, consistent with constant variance. In the right panel, the spread of residuals decreases as X increases, forming a funnel shape that suggests unequal variance (heteroscedasticity).

Detecting nonlinearity

If the scatterplot bends or levels off, a straight-line model may be inappropriate.
Residual plots can reveal curved patterns more clearly.
A curved residual pattern suggests the line is not fitting well.
Random scatter around zero is more consistent with a good linear fit.
Do not force a straight line onto clearly curved data.

Single-panel residual plot of residuals versus explanatory variable X. Points form a curved U-shaped pattern around the horizontal zero line, with positive residuals at low and high X values and negative residuals in the middle, indicating nonlinearity. — Figure 16: Residual plot showing a curved pattern caused by fitting a straight-line regression to a nonlinear relationship. Residuals tend to be positive at low and high values of X and negative in the middle, indicating that a linear model is inappropriate.

Transformations

Transformations can improve linearity or stabilize variance.

Common examples:

\(\log(Y)\)
\(\log(X)\)
\(\log(X)\) and \(\log(Y)\)
\(\sqrt{Y}\) for counts
After transformation, interpret results on the transformed scale unless back-transformed.

Regression toward the mean

Extremely high or low observations often appear closer to average when measured again
This occurs when repeated measurements are imperfectly correlated and include random variation
Example:
- Very high cholesterol at a first visit often tends to be lower later, even without treatment
Improvement after an extreme starting value does not necessarily mean the treatment worked
A control group helps separate real treatment effects from regression toward the mean

Summary

Regression predicts one numerical variable from another
Linear regression fits the least squares line:

\[ \hat{Y} = a + bX \]

Slope describes rate of change
Intercept gives predicted \(Y\) when \(X=0\)
Predictions are strongest within the data range
Confidence bands estimate the mean response; prediction intervals estimate individual outcomes

Slope test: \(H_0: \beta = 0\)
\(R^2\) : proportion of variation in \(Y\) explained by the model
Residual plots help assess linearity, constant variance, and unusual observations
Curved patterns or unequal spread may require transformations or a different model
Extreme values often move closer to average on remeasurement (regression toward the mean)

Lecture 21 Regression