Lecture 2
Data and Variables

ABD 3e Chapter 1

Chris Merkord

Learning Objectives

By the end of this lecture, you should be able to:

  • Distinguish between categorical and numerical variables
  • Identify common sources of confusion in variable classification
  • Determine explanatory and response variables in a study
  • Explain why observational studies cannot establish causation
  • Describe how random assignment reduces confounding in experiments

Data is made of observations, variables, and values

  • Observations are sample units
  • Variable are characteristics of sample units

Source: Data Science with R by Garrett Grolemund

Source: Data Science with R by Garrett Grolemund

Variables are characteristics that differ among individuals or other sampling units

  • Data are the measurements of one or more variables

  • Also called observations, especially in the tidyverse in R

  • Variables can be categorical or numerical

Knowing what kind of variable we have drives our models of data analysis and is therefore critical to the enterprise of statistics

Variables come in two basic types: categorical and numerical

Categorical

  • Values represent groups or labels
  • Describes which kind or which category
  • Arithmetic operations do not make sense
  • Values are names or labels rather than measurements
  • Sometimes called qualitative
  • The categories are called levels
    • Example: For the variable “sex chromosome genotype”, the levels might be XX, XY, XO, XXY, or XYY)

Numerical

  • Values represent quantities
  • Describes how much or how many
  • Arithmetic operations make sense
  • Values are numbers with meaningful magnitudes
    • Example: 4 is twice as many as 2
  • Sometimes called quantitative

Categorical Variables are nominal or ordinal

Nominal

  • No inherent order (sequence)
  • Example:
    • treatment (levels: placebo, dosage 1, dosage 2)
  • Binary variables: special case with only 2 levels
    • Example: survival (levels: alive, dead)

Ordinal

  • Inherent order
  • Often represent discretized numerical variables
  • Examples:
    • life stage (levels: egg, larva, juvenile, subadult, adult)
    • size class (levels: small, med, large)

More examples:

Illustration showing examples of nominal, ordinal, and binary variables. The nominal variable is species (tutle, snail, butterfly), the ordinal variable is happiness level (unhappy, okay, awesome), and the binary variable is extinction status (extinct, not extinct).

(c) Allison Horst

Numerical Variables are continuous or discrete

Continuous

  • Measurements
  • Infinite number of values within possible range
  • Examples:
    • Core body temperature
    • Territory size
    • Cigarette consumption rate

Discrete

  • Can only exist at limited values
  • Counts (often)
  • Examples:
    • Number of mates
    • Number of amino acids in a protein

More examples:

Illustration showing examples of continuous and discrete variables.

(c) Allison Horst

Common confusions, with examples to follow

  • In practice, it can be confusing to decide

    • Ordinal vs. Numerical

    • Continuous vs. Discrete

    • Numerical vs. Categorical but stored as numbers

Numeric labels do not imply numerical variables

  • Some variables use numbers only as labels
  • The order of values is meaningful
  • Distances between values are not defined or consistent
  • Arithmetic operations are usually inappropriate
  • Some such variables are cyclical rather than linear
    • E.g. (month 12 is equidistant from month 1 and 11)

Ozone concentration by calendar month. Month is treated as a categorical variable, even though month numbers appear on the axis. Data from the airquality dataset included with base R (datasets package).

Ozone concentration by calendar month. Month is treated as a categorical variable, even though month numbers appear on the axis. Data from the airquality dataset included with base R (datasets package).

Year can be numerical

  • Year represents elapsed time
  • Differences between years are meaningful
  • Treat year as a numeric time variable

Line graph (time series) showing year as a numerical variable. Average unemployment rate plotted against year treated as a quantitative measure of time, where differences between years are meaningful. Data: economics dataset (ggplot2).

Line graph (time series) showing year as a numerical variable. Average unemployment rate plotted against year treated as a quantitative measure of time, where differences between years are meaningful. Data: economics dataset (ggplot2).

Year can be categorical

  • Year is used as a group label
  • Compare a measured variable in selected years
  • Treat year as a categorical variable

Bar chart showing year as a categorical variable. Average unemployment rate compared across selected, non-equidistant years, with year treated as labeled groups rather than a continuous scale. Data: economics dataset (ggplot2).

Bar chart showing year as a categorical variable. Average unemployment rate compared across selected, non-equidistant years, with year treated as labeled groups rather than a continuous scale. Data: economics dataset (ggplot2).

Age is numerical; age class is ordinal

  • Age as a measured quantity
    • Often in units of days, weeks, years, etc.
    • Differences and averages are meaningful.
  • Age classes group ages into ordered categories
    • The order matters, but distances between classes are inconsistent or not possible to define

Illustration showing a sequence of Bald Eagle heads from juvenile to adult, depicting gradual changes in feather coloration and markings as the bird matures.

Progression of Bald Eagle head plumage from juvenile to adult, illustrating how feather color and pattern change with age. This figure highlights that age classes represent ordered categories, and that the spacing between classes is not necessarily equal. Source: Loudoun Wildlife Conservancy (adapted from Avian Report).

Likert data are ordinal, not numerical

  • Response levels have a clear order
  • Numbers, if used, are labels, not measurements
  • Differences between levels are not defined
  • Treat Likert data as ordered categories
  • Don’t assign categories a numerical value and treat numerically

Horizontal Likert scale with five ordered response options ranging from “Strongly disagree” on the left to “Strongly agree” on the right, illustrating ordered but non-quantitative survey responses.

An example questionnaire about a website design, with answers as a Likert scale. Source: Wikimedia Commons. Image by Nicholas Smith, CC BY-SA 3.0.

Rankings are ordinal, not numerical

  • Values indicate order only
  • Differences between ranks are not meaningful
  • Rank numbers are labels, not quantities
  • Do not treat ranks as measurements

Bar chart of Amazon product ratings showing the percentage of reviews for 1-, 2-, 3-, 4-, and 5-star categories, with an overall average rating of 4.3 out of 5 stars displayed above the chart.

Amazon product rating summary showing the percentage of 1–5 star reviews and a reported average of 4.3 out of 5. Star ratings are ordinal categories, even though they are commonly summarized using numerical averages.

0/1 variables are usually categorical

  • Values represent two categories (binary outcome)
  • Numbers are a coding choice, not a measurement
  • Point estimates are proportions, not means of a quantity
  • Inference depends on modeling a binomial process
  • Do not let variable type be determined by how data are coded
  • Safe default in R: convert to a logical (TRUE / FALSE) variable

Estimating a binary outcome with uncertainty. Both panels show the same point estimates for low birth weight by smoking status. Left: confidence intervals computed correctly by treating the data as binomial proportions. Right: confidence intervals computed incorrectly by treating 0/1 data as numerical means. Data: birthwt (MASS).

Estimating a binary outcome with uncertainty. Both panels show the same point estimates for low birth weight by smoking status. Left: confidence intervals computed correctly by treating the data as binomial proportions. Right: confidence intervals computed incorrectly by treating 0/1 data as numerical means. Data: birthwt (MASS).

Explanatory and response variables

  • Variables are often related to one another
  • Explanatory variable is used to explain or predict variation in another variable
  • Response variable is the outcome being measured
  • In graphs, explanatory usually on x-axis, response on y-axis
  • These terms do not imply causation
  • Which variable plays which role depends on the scientific question

Diagram of a two-axis graph labeled to show variable roles, with financial damage on the horizontal x-axis as the explanatory variable and number of firefighters on the vertical y-axis as the response variable, with arrows pointing to each axis.

Illustration showing the placement of explanatory and response variables in a graph. Financial damage (thousands of dollars) is shown on the x-axis as the explanatory variable, and number of firefighters is shown on the y-axis as the response variable. Source: Sophia Learning, “Explanatory and Response Variables,” Sophia.org.

Frequency distributions

  • Different individuals in a population will have different values of a given parameter (natural variation)

  • The frequency of a particular value is the number of times it is observed in a sample

  • The frequencies of all values can be plotted to produce a frequency distribution

    • Histogram for numerical variable

    • Bar chart for categorical variable

The frequency distribution of beak depths in a sample of 100 finches from a Galápagos island population (Boag and Grant 1984). The vertical axis indicates the frequency, the number of observations in each 0.5-mm interval. Source: Whitlock & Schluter, The Analysis of Biological Data, 3rd ed.

The frequency distribution of beak depths in a sample of 100 finches from a Galápagos island population (Boag and Grant 1984). The vertical axis indicates the frequency, the number of observations in each 0.5-mm interval.
Source: Whitlock & Schluter, The Analysis of Biological Data, 3rd ed.

Probability distributions

  • The distribution of a variable in the whole population is called its probability distribution

  • Provides the probability of occurrence of different possible outcomes

  • The true probability distribution is usually unknown (latent) and is estimated from a sample

A normal distribution. This probability distribution is often used to approximate the distribution of a variable in the population from which a sample has been drawn. Source: Whitlock & Schluter, The Analysis of Biological Data, 3rd ed.

A normal distribution. This probability distribution is often used to approximate the distribution of a variable in the population from which a sample has been drawn.
Source: Whitlock & Schluter, The Analysis of Biological Data, 3rd ed.

Experimental vs. observational studies

Experimental Studies

  • Researcher assigns treatments (randomly) to individuals

  • Treatments are values (i.e. levels) of an categorical explanatory variable

  • Can determine causal relationship between explanatory and response variables

Observational Studies

  • Treatments are not assigned by a researcher

  • Organisms might choose their own treatment

  • Or the treatment might occur naturally

  • Can only determine association between variables

Observational data can show that variables move together, but not why they do

Line graph with two time series from 1990 to 2020 showing ice cream consumption and violent crime rates declining together over time, demonstrating a strong correlation between variables with no causal relationship.

Time series showing a strong correlation between ice cream consumption and violent crime rates in the United States, illustrating a spurious relationship between unrelated variables. Source: Tyler Vigen, Spurious Correlations (tylervigen.com).

Direction of causation is ambiguous in observational studies

EXAMPLE: A study finds that plants with fewer pests tend to have greater biomass.

  • Pests may reduce plant growth and vigor
  • Alternatively, larger or healthier plants may better resist or tolerate pests
  • Both explanations are consistent with the observed pattern
  • Because no treatments were assigned, the direction of causation is unclear

Close-up photograph of a potato beetle on a green plant leaf, showing an insect herbivore feeding on plant tissue.

Potato beetle feeding on a plant leaf, illustrating a plant–herbivore interaction where observed associations (e.g., pest presence and plant biomass) do not, by themselves, reveal the direction of causation. Photo: © Diana Griffin (iNaturalist).

Why experiments can identify causation

  • In experiments, the researcher assigns treatments to units
  • Random assignment balances other variables across treatments
  • This reduces the influence of confounding variables

Confounding variables
- Variables related to both the explanatory and response variables
- Can create associations that are not causal
- Randomization breaks their systematic influence

Result: Differences in outcomes can be attributed to the treatment itself

Example of a confounding variable

General properties of a confounder.

General properties of a confounder.

Example confounding variable. Body mass index may affect end stage renal disease. However, body mass index may also affect blood pressure, which may independently affect renal disease.

Example confounding variable. Body mass index may affect end stage renal disease. However, body mass index may also affect blood pressure, which may independently affect renal disease.

Result: It is impossible to tell assign cause for end stage renal disease to one variable or the other.