Lecture 4
Visualizing Data Types

ABD 3e Chapter 2

Chris Merkord

Why Graphs Matter

Graphs help us see patterns that are hard to detect in raw numbers.

They allow us to:

  • Summarize large amounts of data
  • Compare groups or values quickly
  • See trends, differences, and relationships

But different graphs highlight different things.
Choosing the right graph is necessary to show the most important features of the data clearly.

In data visualization, form follows data structure

The type of graph you choose should be determined by:

  • How many variables you have
  • Whether each variable is categorical or numerical
  • What relationship or pattern you want to reveal

Visualizing One Variable

Distributions: Looking at One Variable

A distribution describes how the values of a single variable are spread out

When you make a histogram or bar chart, you are showing:

  • What values occur
  • How often each value occurs
  • Where values are common or rare

Visualizing distributions helps you:

  • Understand typical values
  • See variability and spread
  • Detect skew, gaps, or unusual values

When you graph one variable, you are almost always graphing its distribution.

One Categorical Variable: Barcharts

Goal: Show how observations are distributed across categories

  • Each column represents a category
  • Height = frequency (number of observations)
  • Bars are separated (categories are distinct)
  • Bar height encodes frequency or proportion
  • Best practice: order by frequency

Barchart showing how penguins observations are distributed among three species in the Palmer Penguins dataset (Horst et al. 2020)

Barchart showing how penguins observations are distributed among three species in the Palmer Penguins dataset (Horst et al. 2020)

Tables can be used if necessary

library(gt)
library(dplyr)
library(palmerpenguins)

penguins |> 
  count(species) |> 
  arrange(desc(n)) |> 
  gt() |> 
  cols_label(
    species = "Species",
    n = "Number of penguins"
  )
Species Number of penguins
Adelie 152
Gentoo 124
Chinstrap 68

Order factors meaningfully

Ordinal variables:

  • use pre-defined order (e.g. months Jan-Dec)

Nominal variables:

  • Most interested in the largest categories
  • Order by frequency
  • Place catch-all category at the bottom
  • Swap axes to give category labels more room

Example: Leading Causes of Death

Bar chart before reordering.

Bar chart before reordering.

Bar chart after reordering.

Bar chart after reordering.

Visualizing One Numerical Variable

Goal: Understand the distribution of values in a single numerical variable.

These graphs are used to show:

  • Where values are concentrated (center, location)
  • How spread out the data are (width, variation)
  • Whether the distribution is symmetric or skewed

Common graph types:

  • Histograms
  • Density plots
  • Cumulative frequency distributions

When you graph one numerical variable, you are visualizing its distribution.

Example: Gettysburg Address

Word length Number of words
1 7
2 50
3 60
4 58
5 34
6 24
7 15
8 6
9 10
10 4
11 3

Histograms visually describe summary tables

Word length distribution in the Gettysburg address

Word length distribution in the Gettysburg address

Histograms display the distribution of a numerical variable using columns

  • Values grouped into intervals called bins

  • Height = number of observations (frequency)

  • Show distribution shape

  • Show skewness or symmetry

  • Show gaps or unusual values

  • Bars touch

Barchart showing the distribution of bill lengths for penguins in the Palmer Penguins dataset (Horst et al. 2020)

Barchart showing the distribution of bill lengths for penguins in the Palmer Penguins dataset (Horst et al. 2020)

Choosing Bin Widths

  • Bin width defines how numerical values are grouped
  • Different bin widths can reveal or obscure patterns
  • Bin widths should be chosen deliberately
  • Default settings are rarely optimal
  • In R you can set either the binwidth or number of bins arguments to geom_histogram()

A histogram with bins that are too narrow.

A histogram with bins that are too narrow.

A histogram with bins that are too wide

A histogram with bins that are too wide

Density plots show the shape of a numerical distribution using a smooth curve

  • Same goal as histograms: reveal shape, center, spread, skew
  • No bins; based on a continuous density estimate
  • Emphasize overall patterns rather than exact counts
  • Well suited for comparing multiple groups on the same axis
  • Shape depends strongly on bandwidth (smoothing level)
  • Defaults can mislead — adjust bandwidth intentionally

Density plot showing the distribution of bill lengths for penguins in the Palmer Penguins dataset (Horst et al. 2020)

Density plot showing the distribution of bill lengths for penguins in the Palmer Penguins dataset (Horst et al. 2020)

Cumulative frequency distributions show quantiles and cumulative proportions

  • Show the proportion of observations ≤ a given value
  • x-axis: data values; y-axis: cumulative proportion (0–1)
  • Quantiles read directly from the curve (median, quartiles, percentiles)
  • Always non-decreasing; ends at 1
  • Well suited for comparing distributions across groups

Cumulative distribution function (CDF) of penguin bill lengths, showing the proportion of individuals with bill lengths less than or equal to a given value in the Palmer Penguins dataset (Horst et al. 2020).

Cumulative distribution function (CDF) of penguin bill lengths, showing the proportion of individuals with bill lengths less than or equal to a given value in the Palmer Penguins dataset (Horst et al. 2020).

CDFs are used when how values accumulate matters

  • Time-to-event data (e.g., survival, germination, development time)
  • Comparing entire distributions across populations or treatments
  • Evaluating biologically meaningful thresholds (e.g., size at maturity, tolerance limits)
  • Formal distribution comparisons (e.g., Kolmogorov–Smirnov tests)

Kaplan–Meier survival curve (1 − CDF) for patients with advanced lung cancer, showing the proportion of individuals who have not yet experienced the event (death) over time. Each downward step corresponds to one or more observed events. Data are from the lung dataset in the R survival package (North Central Cancer Treatment Group).

Kaplan–Meier survival curve (1 − CDF) for patients with advanced lung cancer, showing the proportion of individuals who have not yet experienced the event (death) over time. Each downward step corresponds to one or more observed events. Data are from the lung dataset in the R survival package (North Central Cancer Treatment Group).

Describing a distribution means describing patterns in data

  • A distribution shows how values of a numerical variable are arranged
  • Description comes before calculation or modeling
  • Goal: describe what is typical, how values vary, and what stands out
  • Distributions are described using shape, location, spread, and outliers
  • These features can be seen visually before any statistics are computed

Histogram of simulated data drawn from a normal distribution, illustrating the basic idea of a distribution: how numerical values are arranged and how frequently they occur across the range of the data.

Histogram of simulated data drawn from a normal distribution, illustrating the basic idea of a distribution: how numerical values are arranged and how frequently they occur across the range of the data.

Distribution shape describes the overall pattern of values

  • Shape refers to the form of the distribution as a whole
  • Common shapes include left-skewed, symmetric (normal), and right-skewed
  • Some data are evenly distributed (uniform)
  • Multiple peaks indicate mixed groups or processes (bimodal, multimodal)
  • Shape often reflects underlying biological or sampling processes

Examples of common distribution shapes—left-skewed, normal (symmetric), right-skewed, uniform, bimodal, and multimodal—illustrating how numerical data can vary in overall pattern depending on underlying processes or population structure.

Examples of common distribution shapes—left-skewed, normal (symmetric), right-skewed, uniform, bimodal, and multimodal—illustrating how numerical data can vary in overall pattern depending on underlying processes or population structure.

Location describes where a distribution is centered

  • Where values tend to cluster along the x-axis
  • Distributions can share shape but differ in location
  • Common summaries: mean and median
  • Visual comparison reveals shifts between groups or conditions
  • Differences often reflect biological or environmental change

Histograms showing two distributions with similar shape and spread but different locations, illustrating how shifts in the center (mean or median) move the distribution along the x-axis without changing its overall form.

Histograms showing two distributions with similar shape and spread but different locations, illustrating how shifts in the center (mean or median) move the distribution along the x-axis without changing its overall form.

Spread describes how variable the data are

  • How wide or narrow a distribution is
  • Distributions can share a center but differ in spread
  • Greater spread indicates more heterogeneity in sample units
  • Spread affects uncertainty and overlap between groups
  • Common summaries: range and standard deviation

Histograms showing two distributions with the same center but different spread, illustrating how increased variability widens the distribution and increases overlap and uncertainty.

Histograms showing two distributions with the same center but different spread, illustrating how increased variability widens the distribution and increases overlap and uncertainty.

Outliers are observations that do not follow the main pattern

  • Values far from the bulk of the data
  • May reflect rare events, error, or different processes
  • Can strongly influence summaries and models
  • Should be identified visually before analysis
  • Investigate outliers; do not remove automatically

Histogram showing a distribution with a central cluster of values and a small number of extreme observations (outliers) far from the main pattern.

Histogram showing a distribution with a central cluster of values and a small number of extreme observations (outliers) far from the main pattern.

Visualizing Two Variables

Visualizing relationships between two categorical variables

  • Goal: assess whether categories are associated
  • Common tools: bar plots, contingency tables, mosaic plots
  • Focus on comparing proportions, not raw counts
  • Ask questions like: “Does category A differ across levels of category B?”
  • Useful for exploring relationships before formal tests

Example: Is Breeding Dangerous for Birds?

  • Study by Oppliger et al. (1996) on great tits (Parus major)
    Great Tit (Parus major). Credit: Frank Vassen. Source: Wikimedia Commons.
  • Researchers experimentally altered breeding effort by removing eggs from some nests
  • Control nests were left unchanged
  • Female birds were later assayed for malaria infection
  • Question: is increased reproductive effort associated with higher malaria risk?
Is breeding dangerous for birds?
Malaria outcomes by experimental breeding treatment
Treatment
Total
Control Removed
Malaria 7 15 22
No malaria 28 15 43
Total 35 30 65
Source: Oppliger et al. (1996)

Grouped bar graphs compare category frequencies across groups

  • Bars show counts or proportions for each category
  • Categories are displayed side-by-side within groups
  • Used for two categorical variables
  • Best for comparing differences between groups within categories
  • E.g. compare outcomes within treatments
  • Use proportions when group sizes differ

Grouped bar graph showing counts of malaria outcomes for female great tits in control nests versus nests where eggs were removed to increase breeding effort (Oppliger et al. 1996).

Grouped bar graph showing counts of malaria outcomes for female great tits in control nests versus nests where eggs were removed to increase breeding effort (Oppliger et al. 1996).

Stacked bar graphs show composition within groups

  • Bars represent groups; segments represent categories
  • Emphasize within-group composition
  • Prefer when grouped bars become cluttered
  • Use proportional stacks when group sizes differ
  • Less effective for comparing categories across groups

Stacked bar graph showing counts of malaria outcomes within each treatment group (control vs egg removal) for female great tits (Oppliger et al. 1996).

Stacked bar graph showing counts of malaria outcomes within each treatment group (control vs egg removal) for female great tits (Oppliger et al. 1996).

Mosaic plots show association between two categorical variables

  • Alternative to grouped and stacked bar graphs
  • Best when the goal is to assess association, not exact counts
  • Width shows group size; height shows category proportions
  • Makes differences in proportions and imbalances in group size explicit
  • Especially useful before or alongside chi-square tests

Mosaic plot showing the association between treatment (control vs egg removal) and malaria outcome in female great tits; tile areas are proportional to counts, with widths showing group sizes and heights showing outcome proportions (Oppliger et al. 1996).

Mosaic plot showing the association between treatment (control vs egg removal) and malaria outcome in female great tits; tile areas are proportional to counts, with widths showing group sizes and heights showing outcome proportions (Oppliger et al. 1996).

Visualizing one categorical and one numerical variable

  • Goal: compare the distribution of a numerical variable across categories
  • Multiple histograms:
    • Show full distribution shape within each category
    • Best for moderate to large sample sizes
    • Use consistent bin widths for fair comparison
  • Strip charts:
    • Show individual observations directly
    • Best for small to moderate sample sizes
    • Reveal clustering, overlap, and outliers clearly
  • Choice depends on sample size and whether you want to emphasize shape or individual values

Compare the distribution of a numerical variable across categories

Multiple Histograms

  • Shows full distributions
  • Use consistent bin widths for fair comparison

Strip Chart

  • Shows individual obs.
  • Reveals clustering, overlap, outliers

Violin Plot

  • Shows probability densities

  • Good for comparing shape (needs large sample size)

Scatterplots visualize relationships between two numerical variables

  • Goal: assess how one numerical variable changes with another
  • Show individual observations as points
  • Patterns reveal direction, strength, and form of association
  • Useful for identifying trends, clusters, and outliers
  • Foundation for correlation, regression, and modeling

Scatterplot showing the relationship between flipper length and body mass in penguins, illustrating how two numerical variables vary together and revealing overall trends, spread, and potential outliers.

Scatterplot showing the relationship between flipper length and body mass in penguins, illustrating how two numerical variables vary together and revealing overall trends, spread, and potential outliers.

Time series graphs show change in a variable over time

  • Time on x-axis (ordinal or numerical variable)
  • Points = observations at specific times
  • Lines emphasize continuity and temporal trends
  • Useful for detecting trends, cycles, and sudden changes
  • Order matters: nearby points are often related

Annual reported West Nile virus hospitalizations by case type, shown as stacked counts by year. Neuroinvasive and non-neuroinvasive cases are displayed separately to illustrate changes in both total burden and case composition over time. Source: Arbonet.

Annual reported West Nile virus hospitalizations by case type, shown as stacked counts by year. Neuroinvasive and non-neuroinvasive cases are displayed separately to illustrate changes in both total burden and case composition over time. Source: Arbonet.

Annotations can help tell your story in a time series graph

Maps show how a variable varies over space

  • Location is part of the data, not just context
  • Values are encoded by color, size, or symbols
  • Useful for identifying spatial patterns and gradients
  • Reveal clustering, hotspots, and gaps
  • Interpretation depends on scale and map design

Example: chloropleth maps use color shading to represent values aggregated within geographic areas.

Visualizing three or more variables

Visualizing three or more variables

  • One variable mapped to position (x or y); others mapped to color, size, or shape
  • Faceting splits data into small multiples for clearer comparisons
  • Interaction between variables becomes the focus, not individual effects
  • Choose encodings intentionally to avoid clutter and misinterpretation
  • Add variables only if they clarify the question being asked

Example: Scatterplot (two numerical variables) with colors (categorical variable)

Example: Scatterplot small multiples comparing penguin body mass and flipper length across species and sex

Scatterplots showing the relationship between flipper length and body mass in penguins, facetted by sex (rows) and species (columns), illustrating how this relationship varies across biological groups.

Scatterplots showing the relationship between flipper length and body mass in penguins, facetted by sex (rows) and species (columns), illustrating how this relationship varies across biological groups.