ABD 3e Chapter 17
Regression is used to predict values of one numerical variable from values of another numerical variable.
A regression line summarizes the relationship between two variables in a scatter plot.
It can be used to estimate the expected value of a response variable from an explanatory variable.
In this chapter, we focus on linear regression, where the relationship is described by a straight line.
Genetic diversity in human populations is related to distance from East Africa.
A fitted line can be used to predict expected genetic diversity from migration distance.
The downward slope suggests diversity decreases as distance increases.
Both methods describe relationships between two numerical variables.
Correlation measures strength and direction of association.
Regression fits an equation to predict one variable from another.
Regression also measures how much the response changes when the explanatory variable changes.
Response variable (\(Y\)): the outcome we want to predict.
Explanatory variable (\(X\)): the variable used to explain or predict \(Y\).
In scatter plots for regression:
Regression can be used with observational data.
Example: randomly sample individuals and measure both \(X\) and \(Y\).
Regression can also be used in experiments.
Example: choose treatment levels of \(X\), then measure response \(Y\).
The most common type of regression is linear regression.
It fits a straight line through data to predict \(Y\) from \(X\).
A key assumption is that the true relationship is approximately linear.
If the relationship is strongly curved, a straight line may be inappropriate.
Managers want to estimate the ages of male lions.
Older males may be removed with less disruption than younger males.
Black pigmentation on the nose increases with age.
We use proportion black on the nose to predict age.
Many lines can be drawn through a scatter plot.
We need a rule for choosing the best line.
Least squares chooses the line with the smallest total squared vertical deviations from the points.
These vertical deviations are called residuals.
Poorly chosen lines have large deviations from the data.
Better lines have smaller deviations.
The least squares line minimizes the sum of squared deviations.
Points above the line have positive residuals.
Points below the line have negative residuals.
If we simply added deviations, positives and negatives could cancel.
Squaring avoids cancellation and gives more weight to large errors.
\[ Y = a + bX \]
\(a\) is the intercept.
\(b\) is the slope.
Together, they determine the location and tilt of the line.
Intercept : the predicted value of \(Y\) when \(X = 0\)
Slope : the change in \(Y\) for a one-unit increase in \(X\).
\[ b = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sum (X_i - \bar{X})^2} \]
\[ a = \bar{Y} - b\bar{X} \]
\[ (\bar{X}, \bar{Y}) \]
\[ \hat{Y} = 0.88 + 10.65X \]
\[ \widehat{\text{Age}} = 0.88 + 10.65(\text{proportion black}) \]
Each 1.0 increase in proportion black predicts about 10.65 more years of age.
More practically, a 0.10 increase predicts about:
\[ 10.65 \times 0.10 = 1.065 \]
years older on average.
The line summarizes the average trend.
Individual lions vary around the line.
Not every point falls exactly on the prediction line.
The fitted line from data is a sample estimate.
We use it to estimate the true regression line in the population.
Population form:
\[ Y = \alpha + \beta X \]
Sample estimates:
A predicted value lies on the regression line.
We use:
\[ \hat{Y} \]
for a prediction.
\(\hat{Y}\) estimates the mean response for all individuals with a given \(X\) value.
\[ \hat{Y} = 0.88 + 10.65(0.50) = 6.2 \]
Predictions most reliable within the observed range of \(X\) values
Predicting far outside the data range is called extrapolation
May misleading because relationship may change beyond sampled data
A regression line gives the best predicted mean value of \(Y\) for each value of \(X\).
Predictions are most reliable within the observed range of the explanatory variable.
Every prediction includes uncertainty because the fitted line was estimated from sample data.
Two common goals:
These two questions are different:
What is the average value of \(Y\) when \(X= x\)?
What value of \(Y\) might one individual case have when \(X= x\)?
Both use the same regression line.
The prediction for an individual is less precise because individuals vary around the line.
Confidence band : interval for the mean response
Prediction interval : interval for a single observation
\[ H_0:\beta = 0 \]
\[ H_A:\beta \ne 0 \]
If \(\beta = 0\), there is no linear relationship between \(X\) and the mean of \(Y\)
If the p-value is small, we reject \(H_0\)
(small = less than significance level \(\alpha\) )
Usually called the t-test for the slope of the regression line
Generated automatically in R when you use summary(lm_object)
A significant result (\(p<\alpha\)) suggests evidence that \(X\) is linearly associated with \(Y\) and may help predict \(Y\).
It does not establish causation.
Statistical significance does not necessarily mean a strong relationship or a large slope.
A small slope can be statistically significant if the sample size is large.
A \(p\)-value alone does not describe the size, precision, or practical importance of a relationship.
Interpret the slope test together with other evidence from the model and the data.
Always examine:
slope estimate (effect size)
confidence interval (precision)
scatterplot (pattern, outliers, linearity)
practical importance (real-world relevance)
\[ 0 \le R^2 \le 1 \]
Larger values mean points lie closer to the fitted line
Smaller values mean more unexplained scatter
Example: \(R^2 = 0.22\) means 22% of variation is explained
Correlation coefficient (\(r\))
measures strength and direction of a linear relationship
ranges from \(-1\) to \(1\)
Coefficient of determination (\(R^2\))
measures the proportion of variation in \(Y\) explained by the model
ranges from \(0\) to \(1\)
P-value for the slope
tests whether the population slope differs from zero
does not measure effect size or model fit
These statistics are related, but they answer different questions.
\[ e_i = y_i - \hat{y}_i \]
Positive residuals: points above the regression line
Negative residuals: points below the line
Plot residuals against \(X\) or against fitted values (\(\hat{y}\))
Help assess linearity, constant variance, and unusual observations
If the scatterplot bends or levels off, a straight-line model may be inappropriate.
Residual plots can reveal curved patterns more clearly.
A curved residual pattern suggests the line is not fitting well.
Random scatter around zero is more consistent with a good linear fit.
Do not force a straight line onto clearly curved data.
Common examples:
\(\log(Y)\)
\(\log(X)\)
\(\log(X)\) and \(\log(Y)\)
\(\sqrt{Y}\) for counts
After transformation, interpret results on the transformed scale unless back-transformed.
Extremely high or low observations often appear closer to average when measured again
This occurs when repeated measurements are imperfectly correlated and include random variation
Example:
Improvement after an extreme starting value does not necessarily mean the treatment worked
A control group helps separate real treatment effects from regression toward the mean
Regression predicts one numerical variable from another
Linear regression fits the least squares line:
\[ \hat{Y} = a + bX \]
Slope describes rate of change
Intercept gives predicted \(Y\) when \(X=0\)
Predictions are strongest within the data range
Confidence bands estimate the mean response; prediction intervals estimate individual outcomes
Slope test: \(H_0: \beta = 0\)
\(R^2\) : proportion of variation in \(Y\) explained by the model
Residual plots help assess linearity, constant variance, and unusual observations
Curved patterns or unequal spread may require transformations or a different model
Extreme values often move closer to average on remeasurement (regression toward the mean)

BIOL 275 Biostatistics | Spring 2026