Lecture 5
Describing Data

ABD 3e Chapter 3

Chris Merkord

Learning Objectives

  • Differentiate between estimates of location and estimates of width
  • Recognize that variability is not simply noise but is a key parameter that can be estimated
  • Become familiar with the most common descriptive statistics
  • Know when the mean or median is a more appropriate summary of location

Three Common Descriptions of Data

  • Location (central tendency)

  • Width (spread)

  • Association* (correlation)

*later in term

We do statistics to learn about the World Out There (the population) from Data (the sample)

Estimates and Parameters

  • We almost never sample an entire population. So we almost never know parameters from populations.
  • But we can make educated guesses of parameters by making estimates from samples.

Histograms reveal location

Also called “center” or “central tendency” of values

Figure 1: Distribution of sepal length for the three Iris species, shown as histograms faceted by species. Source: iris dataset in R.

Histograms reveal location

Also called “center” or “central tendency” of values

Figure 2: Distribution of sepal length for the three Iris species, shown as histograms faceted by species. Red vertical lines indicate species-specific means. Source: iris dataset in R.

Choose the right measure of location

  • Location summarizes where values tend to fall in a distribution.

  • Different summaries answer different questions.

Mean

  • Balance point of the distribution
  • Strongly influenced by extreme values

Median

  • Typical value for an individual
  • Robust to skew and outliers

Mode

  • Most common value or range
  • Highlights peaks in the distribution

The mean

Estimate = \(\bar{Y}\)

“Sample mean”

Parameter = \(\mu\)

“Population mean”

\[ \bar{Y} = \frac{\sum Y_i}{n} \]

where

  • \(\bar{Y}\) is the sample mean
  • \(\sum\) denotes summation over all observations
  • \(n\) is the number of observations
  • \(Y_i\) is the observed value for the \(i^{\text{th}}\) individual

The median

The median is the middle value of an ordered dataset.

  • Order the observations from smallest to largest
  • If n is odd, the median is the \((n + 1)/2\)th value
  • If n is even, the median is the mean of the \(n/2\)th and \((n/2 + 1)\)th values

Source: Blythwood / Wikipedia

Source: Blythwood / Wikipedia

The mode

The mode is the most common value in a dataset.

  • For numerical data, it usually represents a range of values (a peak in a histogram)
  • A distribution can have one mode, multiple modes, or no clear mode
  • Modes are useful for identifying common outcomes or clusters, not for summarizing overall center

Mean, median, and mode summarize location differently

  • These measures summarize different aspects of location
  • They coincide in symmetric distributions
  • They diverge in skewed distributions
  • Differences reveal the shape of the distribution, not noise.

Geometric visualisation of the mode, median and mean of an arbitrary probability density function.

Geometric visualisation of the mode, median and mean of an arbitrary probability density function. Credit: Cmglee, CC BY-SA 3.0, via Wikimedia Commons

There are multiple ways to measure width (variability)

Different measures of width emphasize different aspects of spread.

  • Range
    Difference between the largest and smallest values

  • Interquartile range (IQR)
    Spread of the middle 50% of values

  • Variance and standard deviation
    Typical deviation from the mean

  • Coefficient of variation (CV)
    Variability expressed relative to the mean

Range is the difference between the smallest and largest values

Use as a simple measure of spread, when working with small datasets, or as a quick exploratory metric

Example:

  • Minimum value = \(1.25\)

  • Maximum value = \(3.55\)

  • Range = \(3.55−1.25=2.30\)

Because small samples tend to give lower estimates of the range than large samples, the sample range is a biased estimator of the true range of the population.

Interquartile range describes spread of the middle 50% of values

  • Median = middle observation, partitions data into two halves

  • Quartiles = Q1, Q2 (median), Q3, partition data into four quarters

  • Interquartile range (IQR) =“Q” 3−“Q” 1

Example: \(Q3−Q1 = 3.045−2.34=0.705\)

Boxplots display the median, IQR, outliers

Example of a boxplot comparing groups

  • IQR is the yellow span

  • Location of the horizontal line (median) within the box indicates skew

    • Toward the top = left skew

    • Toward the bottom = right skew

  • Points are outliers

    • Common definition of outliers: observations which lie more than 1.5× “IQR” from the edge of the box

Variance, standard deviation, and CV quantify spread around the mean

These measures quantify how far observations typically lie from the average.

  • Observations fall both above and below the mean
  • Simple differences cancel out, so variability needs a different approach
  • Squared differences avoid cancellation and work well mathematically
  • Variance and standard deviation measure absolute spread
  • The coefficient of variation (CV) measures relative spread

Key idea:
Variability is a property of the data, not noise.

The variance

  • Average squared distance of observations from the mean
  • Emphasizes large deviations
  • Expressed in squared units
  • Numerator often called the “sum of squared differences” or the “sum of squares”.
  • Parameter (true population value) \(\sigma^2\)
  • Statistic (estimated from sample) \(s^2\)

Credit: Rod Pierce / mathisfun.com

Credit: Rod Pierce / mathisfun.com

\[ \sigma^2 = \frac{\sum (x_i - \mu)^2}{n} \]

\[ s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1} \] where

  • \(x_i\) is the observed value for the \(i^{\text{th}}\) individual
  • \(\mu\) is the sample mean
  • \(\sum\) denotes summation over all observations
  • \(n\) is the number of observations

Why \(n\) for population and \(n-1\) for sample? (Bessel’s correction)

  • When you compute the sample mean, you’re letting the data pick the center that minimizes the squared distances.
  • That makes the distances look smaller than they really are in the population
  • So we slightly inflate the average squared distance by dividing by n−1 instead of n.
  • Simply:
    • Population: we already know the true center, so we divide by \(n\)
    • Sample: we had to estimate the center first, so we divide by \(n - 1\)
  • Most important for small samples
  • What if we didn’t use correction? False impression of precision, harder to detect true differences between groups

The standard deviation

  • Typical distance of observations from the mean
  • Expressed in the same units as the data
  • Larger values indicate greater spread
  • Parameter (true population value) \(\sigma\)
  • Statistics (estimated from sample) \(s\)

\[ \sigma = \sqrt{\frac{\sum (x_i - \mu)^2}{n}} \]

\[ s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n - 1}} \]

The coefficient of variation

  • Standard deviation relative to the mean
  • Unitless measure of variability
  • Useful for comparing spread across different scales

\[ \mathrm{CV} = \frac{s}{\bar{x}} \times 100\% \]

where

  • \(s\) is the sample standard deviation
  • \(\bar{x}\) is the sample mean
  • \(\mathrm{CV}\) is expressed as a percentage

Choose the right measure of spread

Use different measures of variability depending on your goal.

Use variance when

  • You are working with equations or statistical models
  • Variability needs to be combined or compared mathematically

Use standard deviation when

  • You want variability expressed in the same units as the data
  • You are describing spread to a general audience

Use coefficient of variation (CV) when

  • You are comparing variability across different units or scales
  • Relative variability matters more than absolute differences

Rounding: when and how much?

Rounding affects how results are interpreted.

  • Do not round intermediate steps. Keep full precision during calculations to avoid compounding error

  • Round only final results. This preserves accuracy while improving readability

  • Report appropriate precision. Typically one–two decimal places more than the original measurements

Goal: Communicate results clearly without implying false precision.

Example: Values are integers, report mean and SD w/ 1 decimal place

Binary data: the mean is a proportion

Binary outcomes can be coded as 0/1 (e.g., no/yes, absent/present).

  • Code one outcome as 1 and the other as 0
  • The mean equals the proportion of 1s
  • This is equivalent to the fraction of observations with that outcome

Example: If 4 out of 5 individuals survive, the mean of the 0/1 data is 0.8 (80%).