Lecture 2
Data and Variables
ABD 3e Chapter 1
Learning Objectives
By the end of this lecture, you should be able to:
- Distinguish between categorical and numerical variables
- Identify common sources of confusion in variable classification
- Determine explanatory and response variables in a study
- Explain why observational studies cannot establish causation
- Describe how random assignment reduces confounding in experiments
Data is made of observations, variables, and values
- Observations are sample units
- Variable are characteristics of sample units
Variables are characteristics that differ among individuals or other sampling units
Data are the measurements of one or more variables
Also called observations, especially in the tidyverse in R
Variables can be categorical or numerical
Knowing what kind of variable we have drives our models of data analysis and is therefore critical to the enterprise of statistics
Variables come in two basic types: categorical and numerical
Categorical
- Values represent groups or labels
- Describes which kind or which category
- Arithmetic operations do not make sense
- Values are names or labels rather than measurements
- Sometimes called qualitative
- The categories are called levels
- Example: For the variable “sex chromosome genotype”, the levels might be XX, XY, XO, XXY, or XYY)
Numerical
- Values represent quantities
- Describes how much or how many
- Arithmetic operations make sense
- Values are numbers with meaningful magnitudes
- Example: 4 is twice as many as 2
- Sometimes called quantitative
Categorical Variables are nominal or ordinal
Nominal
- No inherent order (sequence)
- Example:
- treatment (levels: placebo, dosage 1, dosage 2)
- Binary variables: special case with only 2 levels
- Example: survival (levels: alive, dead)
Ordinal
- Inherent order
- Often represent discretized numerical variables
- Examples:
- life stage (levels: egg, larva, juvenile, subadult, adult)
- size class (levels: small, med, large)
Numerical Variables are continuous or discrete
Continuous
- Measurements
- Infinite number of values within possible range
- Examples:
- Core body temperature
- Territory size
- Cigarette consumption rate
Discrete
- Can only exist at limited values
- Counts (often)
- Examples:
- Number of mates
- Number of amino acids in a protein
Common confusions, with examples to follow
Numeric labels do not imply numerical variables
- Some variables use numbers only as labels
- The order of values is meaningful
- Distances between values are not defined or consistent
- Arithmetic operations are usually inappropriate
- Some such variables are cyclical rather than linear
- E.g. (month 12 is equidistant from month 1 and 11)
Year can be numerical
- Year represents elapsed time
- Differences between years are meaningful
- Treat year as a numeric time variable
Year can be categorical
- Year is used as a group label
- Compare a measured variable in selected years
- Treat year as a categorical variable
Age is numerical; age class is ordinal
- Age as a measured quantity
- Often in units of days, weeks, years, etc.
- Differences and averages are meaningful.
- Age classes group ages into ordered categories
- The order matters, but distances between classes are inconsistent or not possible to define
Likert data are ordinal, not numerical
- Response levels have a clear order
- Numbers, if used, are labels, not measurements
- Differences between levels are not defined
- Treat Likert data as ordered categories
- Don’t assign categories a numerical value and treat numerically
Rankings are ordinal, not numerical
- Values indicate order only
- Differences between ranks are not meaningful
- Rank numbers are labels, not quantities
- Do not treat ranks as measurements
0/1 variables are usually categorical
- Values represent two categories (binary outcome)
- Numbers are a coding choice, not a measurement
- Point estimates are proportions, not means of a quantity
- Inference depends on modeling a binomial process
- Do not let variable type be determined by how data are coded
- Safe default in R: convert to a logical (
TRUE / FALSE) variable
Explanatory and response variables
- Variables are often related to one another
- Explanatory variable is used to explain or predict variation in another variable
- Response variable is the outcome being measured
- In graphs, explanatory usually on x-axis, response on y-axis
- These terms do not imply causation
- Which variable plays which role depends on the scientific question
Frequency distributions
Different individuals in a population will have different values of a given parameter (natural variation)
The frequency of a particular value is the number of times it is observed in a sample
The frequencies of all values can be plotted to produce a frequency distribution
Probability distributions
The distribution of a variable in the whole population is called its probability distribution
Provides the probability of occurrence of different possible outcomes
The true probability distribution is usually unknown (latent) and is estimated from a sample
Experimental vs. observational studies
Experimental Studies
Researcher assigns treatments (randomly) to individuals
Treatments are values (i.e. levels) of an categorical explanatory variable
Can determine causal relationship between explanatory and response variables
Observational Studies
Treatments are not assigned by a researcher
Organisms might choose their own treatment
Or the treatment might occur naturally
Can only determine association between variables
Observational data can show that variables move together, but not why they do
Direction of causation is ambiguous in observational studies
EXAMPLE: A study finds that plants with fewer pests tend to have greater biomass.
- Pests may reduce plant growth and vigor
- Alternatively, larger or healthier plants may better resist or tolerate pests
- Both explanations are consistent with the observed pattern
- Because no treatments were assigned, the direction of causation is unclear
Why experiments can identify causation
- In experiments, the researcher assigns treatments to units
- Random assignment balances other variables across treatments
- This reduces the influence of confounding variables
Confounding variables
- Variables related to both the explanatory and response variables
- Can create associations that are not causal
- Randomization breaks their systematic influence
Result: Differences in outcomes can be attributed to the treatment itself
Example of a confounding variable
Result: It is impossible to tell assign cause for end stage renal disease to one variable or the other.