BIOL 275 Biostatistics – quarto-input430b1ff62e4dfbd5

Learning Objectives

By the end of this lecture, you should be able to:

Distinguish between categorical and numerical variables
Identify common sources of confusion in variable classification
Determine explanatory and response variables in a study
Explain why observational studies cannot establish causation
Describe how random assignment reduces confounding in experiments

Data is made of observations, variables, and values

Observations are sample units
Variable are characteristics of sample units

Source: Data Science with R by Garrett Grolemund

Variables are characteristics that differ among individuals or other sampling units

Data are the measurements of one or more variables
Also called observations, especially in the tidyverse in R
Variables can be categorical or numerical

Knowing what kind of variable we have drives our models of data analysis and is therefore critical to the enterprise of statistics

Variables come in two basic types: categorical and numerical

Categorical

Values represent groups or labels
Describes which kind or which category
Arithmetic operations do not make sense
Values are names or labels rather than measurements
Sometimes called qualitative
The categories are called levels
- Example: For the variable “sex chromosome genotype”, the levels might be XX, XY, XO, XXY, or XYY)

Numerical

Values represent quantities
Describes how much or how many
Arithmetic operations make sense
Values are numbers with meaningful magnitudes
- Example: 4 is twice as many as 2
Sometimes called quantitative

Categorical Variables are nominal or ordinal

Nominal

No inherent order (sequence)
Example:
- treatment (levels: placebo, dosage 1, dosage 2)
Binary variables: special case with only 2 levels
- Example: survival (levels: alive, dead)

Ordinal

Inherent order
Often represent discretized numerical variables
Examples:
- life stage (levels: egg, larva, juvenile, subadult, adult)
- size class (levels: small, med, large)

More examples:

Illustration showing examples of nominal, ordinal, and binary variables. The nominal variable is species (tutle, snail, butterfly), the ordinal variable is happiness level (unhappy, okay, awesome), and the binary variable is extinction status (extinct, not extinct). — (c) Allison Horst

Numerical Variables are continuous or discrete

Continuous

Measurements
Infinite number of values within possible range
Examples:
- Core body temperature
- Territory size
- Cigarette consumption rate

Discrete

Can only exist at limited values
Counts (often)
Examples:
- Number of mates
- Number of amino acids in a protein

More examples:

Illustration showing examples of continuous and discrete variables. — (c) Allison Horst

Common confusions, with examples to follow

In practice, it can be confusing to decide
- Ordinal vs. Numerical
- Continuous vs. Discrete
- Numerical vs. Categorical but stored as numbers

Numeric labels do not imply numerical variables

Some variables use numbers only as labels
The order of values is meaningful
Distances between values are not defined or consistent
Arithmetic operations are usually inappropriate
Some such variables are cyclical rather than linear
- E.g. (month 12 is equidistant from month 1 and 11)

Ozone concentration by calendar month. Month is treated as a categorical variable, even though month numbers appear on the axis. Data from the airquality dataset included with base R (datasets package). — **Ozone concentration by calendar month.** Month is treated as a categorical variable, even though month numbers appear on the axis. Data from the airquality dataset included with base R (datasets package).

Year can be numerical

Year represents elapsed time
Differences between years are meaningful
Treat year as a numeric time variable

Line graph (time series) showing year as a numerical variable. Average unemployment rate plotted against year treated as a quantitative measure of time, where differences between years are meaningful. Data: economics dataset (ggplot2). — **Line graph (time series) showing year as a numerical variable.** Average unemployment rate plotted against year treated as a quantitative measure of time, where differences between years are meaningful. Data: economics dataset (ggplot2).

Year can be categorical

Year is used as a group label
Compare a measured variable in selected years
Treat year as a categorical variable

Bar chart showing year as a categorical variable. Average unemployment rate compared across selected, non-equidistant years, with year treated as labeled groups rather than a continuous scale. Data: economics dataset (ggplot2). — **Bar chart showing year as a categorical variable.** Average unemployment rate compared across selected, non-equidistant years, with year treated as labeled groups rather than a continuous scale. Data: economics dataset (ggplot2).

Age is numerical; age class is ordinal

Age as a measured quantity
- Often in units of days, weeks, years, etc.
- Differences and averages are meaningful.
Age classes group ages into ordered categories
- The order matters, but distances between classes are inconsistent or not possible to define

Illustration showing a sequence of Bald Eagle heads from juvenile to adult, depicting gradual changes in feather coloration and markings as the bird matures. — Progression of Bald Eagle head plumage from juvenile to adult, illustrating how feather color and pattern change with age. This figure highlights that age classes represent ordered categories, and that the spacing between classes is not necessarily equal. Source: Loudoun Wildlife Conservancy (adapted from Avian Report).

Likert data are ordinal, not numerical

Response levels have a clear order
Numbers, if used, are labels, not measurements
Differences between levels are not defined
Treat Likert data as ordered categories
Don’t assign categories a numerical value and treat numerically

Horizontal Likert scale with five ordered response options ranging from “Strongly disagree” on the left to “Strongly agree” on the right, illustrating ordered but non-quantitative survey responses. — An example questionnaire about a website design, with answers as a Likert scale. Source: Wikimedia Commons. Image by Nicholas Smith, CC BY-SA 3.0.

Rankings are ordinal, not numerical

Values indicate order only
Differences between ranks are not meaningful
Rank numbers are labels, not quantities
Do not treat ranks as measurements

Bar chart of Amazon product ratings showing the percentage of reviews for 1-, 2-, 3-, 4-, and 5-star categories, with an overall average rating of 4.3 out of 5 stars displayed above the chart. — Amazon product rating summary showing the percentage of 1–5 star reviews and a reported average of 4.3 out of 5. Star ratings are ordinal categories, even though they are commonly summarized using numerical averages.

0/1 variables are usually categorical

Values represent two categories (binary outcome)
Numbers are a coding choice, not a measurement
Point estimates are proportions, not means of a quantity
Inference depends on modeling a binomial process
Do not let variable type be determined by how data are coded
Safe default in R: convert to a logical (TRUE / FALSE) variable

Estimating a binary outcome with uncertainty. Both panels show the same point estimates for low birth weight by smoking status. Left: confidence intervals computed correctly by treating the data as binomial proportions. Right: confidence intervals computed incorrectly by treating 0/1 data as numerical means. Data: birthwt (MASS). — **Estimating a binary outcome with uncertainty.** Both panels show the same point estimates for low birth weight by smoking status. Left: confidence intervals computed correctly by treating the data as binomial proportions. Right: confidence intervals computed incorrectly by treating 0/1 data as numerical means. Data: `birthwt` (MASS).

Explanatory and response variables

Variables are often related to one another
Explanatory variable is used to explain or predict variation in another variable
Response variable is the outcome being measured
In graphs, explanatory usually on x-axis, response on y-axis
These terms do not imply causation
Which variable plays which role depends on the scientific question

Diagram of a two-axis graph labeled to show variable roles, with financial damage on the horizontal x-axis as the explanatory variable and number of firefighters on the vertical y-axis as the response variable, with arrows pointing to each axis. — Illustration showing the placement of explanatory and response variables in a graph. Financial damage (thousands of dollars) is shown on the x-axis as the explanatory variable, and number of firefighters is shown on the y-axis as the response variable. Source: Sophia Learning, “Explanatory and Response Variables,” Sophia.org.

Frequency distributions

Different individuals in a population will have different values of a given parameter (natural variation)
The frequency of a particular value is the number of times it is observed in a sample
The frequencies of all values can be plotted to produce a frequency distribution
- Histogram for numerical variable
- Bar chart for categorical variable

The frequency distribution of beak depths in a sample of 100 finches from a Galápagos island population (Boag and Grant 1984). The vertical axis indicates the frequency, the number of observations in each 0.5-mm interval. Source: Whitlock & Schluter, The Analysis of Biological Data, 3rd ed. — The frequency distribution of beak depths in a sample of 100 finches from a Galápagos island population (Boag and Grant 1984). The vertical axis indicates the frequency, the number of observations in each 0.5-mm interval.
Source: Whitlock & Schluter, *The Analysis of Biological Data*, 3rd ed.

Probability distributions

The distribution of a variable in the whole population is called its probability distribution
Provides the probability of occurrence of different possible outcomes
The true probability distribution is usually unknown (latent) and is estimated from a sample

A normal distribution. This probability distribution is often used to approximate the distribution of a variable in the population from which a sample has been drawn. Source: Whitlock & Schluter, The Analysis of Biological Data, 3rd ed. — A normal distribution. This probability distribution is often used to approximate the distribution of a variable in the population from which a sample has been drawn.
Source: Whitlock & Schluter, *The Analysis of Biological Data*, 3rd ed.

Experimental vs. observational studies

Experimental Studies

Researcher assigns treatments (randomly) to individuals
Treatments are values (i.e. levels) of an categorical explanatory variable
Can determine causal relationship between explanatory and response variables

Observational Studies

Treatments are not assigned by a researcher
Organisms might choose their own treatment
Or the treatment might occur naturally
Can only determine association between variables

Observational data can show that variables move together, but not why they do

Line graph with two time series from 1990 to 2020 showing ice cream consumption and violent crime rates declining together over time, demonstrating a strong correlation between variables with no causal relationship. — Time series showing a strong correlation between ice cream consumption and violent crime rates in the United States, illustrating a spurious relationship between unrelated variables. Source: Tyler Vigen, Spurious Correlations (tylervigen.com).

Direction of causation is ambiguous in observational studies

EXAMPLE: A study finds that plants with fewer pests tend to have greater biomass.

Pests may reduce plant growth and vigor
Alternatively, larger or healthier plants may better resist or tolerate pests
Both explanations are consistent with the observed pattern
Because no treatments were assigned, the direction of causation is unclear

Close-up photograph of a potato beetle on a green plant leaf, showing an insect herbivore feeding on plant tissue. — Potato beetle feeding on a plant leaf, illustrating a plant–herbivore interaction where observed associations (e.g., pest presence and plant biomass) do not, by themselves, reveal the direction of causation. Photo: © Diana Griffin (iNaturalist).

Why experiments can identify causation

In experiments, the researcher assigns treatments to units
Random assignment balances other variables across treatments
This reduces the influence of confounding variables

Confounding variables
- Variables related to both the explanatory and response variables
- Can create associations that are not causal
- Randomization breaks their systematic influence

Result: Differences in outcomes can be attributed to the treatment itself

Example of a confounding variable

Example confounding variable. Body mass index may affect end stage renal disease. However, body mass index may also affect blood pressure, which may independently affect renal disease.

Result: It is impossible to tell assign cause for end stage renal disease to one variable or the other.

Lecture 2 Data and Variables