Lecture 4
Visualizing Data Types

ABD 3e Chapter 2

Chris Merkord

Why Graphs Matter

Graphs help us see patterns that are hard to detect in raw numbers.

They allow us to:

Summarize large amounts of data
Compare groups or values quickly
See trends, differences, and relationships

But different graphs highlight different things.
Choosing the right graph is necessary to show the most important features of the data clearly.

In data visualization, form follows data structure

The type of graph you choose should be determined by:

How many variables you have
Whether each variable is categorical or numerical
What relationship or pattern you want to reveal

Visualizing One Variable

Distributions: Looking at One Variable

A distribution describes how the values of a single variable are spread out

When you make a histogram or bar chart, you are showing:

What values occur
How often each value occurs
Where values are common or rare

Visualizing distributions helps you:

Understand typical values
See variability and spread
Detect skew, gaps, or unusual values

When you graph one variable, you are almost always graphing its distribution.

One Categorical Variable: Barcharts

Goal: Show how observations are distributed across categories

Each column represents a category
Height = frequency (number of observations)
Bars are separated (categories are distinct)
Bar height encodes frequency or proportion
Best practice: order by frequency

Barchart showing how penguins observations are distributed among three species in the Palmer Penguins dataset (Horst et al. 2020)

Tables can be used if necessary

library(gt)
library(dplyr)
library(palmerpenguins)

penguins |> 
  count(species) |> 
  arrange(desc(n)) |> 
  gt() |> 
  cols_label(
    species = "Species",
    n = "Number of penguins"
  )

Species	Number of penguins
Adelie	152
Gentoo	124
Chinstrap	68

Order factors meaningfully

Ordinal variables:

use pre-defined order (e.g. months Jan-Dec)

Nominal variables:

Most interested in the largest categories
Order by frequency
Place catch-all category at the bottom
Swap axes to give category labels more room

Example: Leading Causes of Death

Visualizing One Numerical Variable

Goal: Understand the distribution of values in a single numerical variable.

These graphs are used to show:

Where values are concentrated (center, location)
How spread out the data are (width, variation)
Whether the distribution is symmetric or skewed

Common graph types:

Histograms
Density plots
Cumulative frequency distributions

When you graph one numerical variable, you are visualizing its distribution.

Example: Gettysburg Address

Word length	Number of words
1	7
2	50
3	60
4	58
5	34
6	24
7	15
8	6
9	10
10	4
11	3

Histograms visually describe summary tables

Word length distribution in the Gettysburg address

Histograms display the distribution of a numerical variable using columns

Values grouped into intervals called bins
Height = number of observations (frequency)
Show distribution shape
Show skewness or symmetry
Show gaps or unusual values
Bars touch

Barchart showing the distribution of bill lengths for penguins in the Palmer Penguins dataset (Horst et al. 2020)

Choosing Bin Widths

Bin width defines how numerical values are grouped
Different bin widths can reveal or obscure patterns
Bin widths should be chosen deliberately
Default settings are rarely optimal
In R you can set either the binwidth or number of bins arguments to geom_histogram()

A histogram with bins that are too narrow.

Density plots show the shape of a numerical distribution using a smooth curve

Same goal as histograms: reveal shape, center, spread, skew
No bins; based on a continuous density estimate
Emphasize overall patterns rather than exact counts
Well suited for comparing multiple groups on the same axis
Shape depends strongly on bandwidth (smoothing level)
Defaults can mislead — adjust bandwidth intentionally

Density plot showing the distribution of bill lengths for penguins in the Palmer Penguins dataset (Horst et al. 2020)

Cumulative frequency distributions show quantiles and cumulative proportions

Show the proportion of observations ≤ a given value
x-axis: data values; y-axis: cumulative proportion (0–1)
Quantiles read directly from the curve (median, quartiles, percentiles)
Always non-decreasing; ends at 1
Well suited for comparing distributions across groups

Cumulative distribution function (CDF) of penguin bill lengths, showing the proportion of individuals with bill lengths less than or equal to a given value in the Palmer Penguins dataset (Horst et al. 2020).

CDFs are used when how values accumulate matters

Time-to-event data (e.g., survival, germination, development time)
Comparing entire distributions across populations or treatments
Evaluating biologically meaningful thresholds (e.g., size at maturity, tolerance limits)
Formal distribution comparisons (e.g., Kolmogorov–Smirnov tests)

Kaplan–Meier survival curve (1 − CDF) for patients with advanced lung cancer, showing the proportion of individuals who have not yet experienced the event (death) over time. Each downward step corresponds to one or more observed events. Data are from the lung dataset in the R survival package (North Central Cancer Treatment Group). — Kaplan–Meier survival curve (1 − CDF) for patients with advanced lung cancer, showing the proportion of individuals who have not yet experienced the event (death) over time. Each downward step corresponds to one or more observed events. Data are from the `lung` dataset in the R `survival` package (North Central Cancer Treatment Group).

Describing a distribution means describing patterns in data

A distribution shows how values of a numerical variable are arranged
Description comes before calculation or modeling
Goal: describe what is typical, how values vary, and what stands out
Distributions are described using shape, location, spread, and outliers
These features can be seen visually before any statistics are computed

Histogram of simulated data drawn from a normal distribution, illustrating the basic idea of a distribution: how numerical values are arranged and how frequently they occur across the range of the data.

Distribution shape describes the overall pattern of values

Shape refers to the form of the distribution as a whole
Common shapes include left-skewed, symmetric (normal), and right-skewed
Some data are evenly distributed (uniform)
Multiple peaks indicate mixed groups or processes (bimodal, multimodal)
Shape often reflects underlying biological or sampling processes

Examples of common distribution shapes—left-skewed, normal (symmetric), right-skewed, uniform, bimodal, and multimodal—illustrating how numerical data can vary in overall pattern depending on underlying processes or population structure.

Location describes where a distribution is centered

Where values tend to cluster along the x-axis
Distributions can share shape but differ in location
Common summaries: mean and median
Visual comparison reveals shifts between groups or conditions
Differences often reflect biological or environmental change

Histograms showing two distributions with similar shape and spread but different locations, illustrating how shifts in the center (mean or median) move the distribution along the x-axis without changing its overall form.

Spread describes how variable the data are

How wide or narrow a distribution is
Distributions can share a center but differ in spread
Greater spread indicates more heterogeneity in sample units
Spread affects uncertainty and overlap between groups
Common summaries: range and standard deviation

Histograms showing two distributions with the same center but different spread, illustrating how increased variability widens the distribution and increases overlap and uncertainty.

Outliers are observations that do not follow the main pattern

Values far from the bulk of the data
May reflect rare events, error, or different processes
Can strongly influence summaries and models
Should be identified visually before analysis
Investigate outliers; do not remove automatically

Histogram showing a distribution with a central cluster of values and a small number of extreme observations (outliers) far from the main pattern.

Visualizing Two Variables

Visualizing relationships between two categorical variables

Goal: assess whether categories are associated
Common tools: bar plots, contingency tables, mosaic plots
Focus on comparing proportions, not raw counts
Ask questions like: “Does category A differ across levels of category B?”
Useful for exploring relationships before formal tests

Example: Is Breeding Dangerous for Birds?

Study by Oppliger et al. (1996) on great tits (Parus major)
Researchers experimentally altered breeding effort by removing eggs from some nests
Control nests were left unchanged
Female birds were later assayed for malaria infection
Question: is increased reproductive effort associated with higher malaria risk?

	Treatment		Total
Is breeding dangerous for birds?
Malaria outcomes by experimental breeding treatment
	Control	Removed	Total
Malaria	7	15	22
No malaria	28	15	43
Total	35	30	65
Source: Oppliger et al. (1996)

Grouped bar graphs compare category frequencies across groups

Bars show counts or proportions for each category
Categories are displayed side-by-side within groups
Used for two categorical variables
Best for comparing differences between groups within categories
E.g. compare outcomes within treatments
Use proportions when group sizes differ

Grouped bar graph showing counts of malaria outcomes for female great tits in control nests versus nests where eggs were removed to increase breeding effort (Oppliger et al. 1996).

Stacked bar graphs show composition within groups

Bars represent groups; segments represent categories
Emphasize within-group composition
Prefer when grouped bars become cluttered
Use proportional stacks when group sizes differ
Less effective for comparing categories across groups

Stacked bar graph showing counts of malaria outcomes within each treatment group (control vs egg removal) for female great tits (Oppliger et al. 1996).

Mosaic plots show association between two categorical variables

Alternative to grouped and stacked bar graphs
Best when the goal is to assess association, not exact counts
Width shows group size; height shows category proportions
Makes differences in proportions and imbalances in group size explicit
Especially useful before or alongside chi-square tests

Mosaic plot showing the association between treatment (control vs egg removal) and malaria outcome in female great tits; tile areas are proportional to counts, with widths showing group sizes and heights showing outcome proportions (Oppliger et al. 1996).

Visualizing one categorical and one numerical variable

Goal: compare the distribution of a numerical variable across categories
Multiple histograms:
- Show full distribution shape within each category
- Best for moderate to large sample sizes
- Use consistent bin widths for fair comparison
Strip charts:
- Show individual observations directly
- Best for small to moderate sample sizes
- Reveal clustering, overlap, and outliers clearly
Choice depends on sample size and whether you want to emphasize shape or individual values

Compare the distribution of a numerical variable across categories

Multiple Histograms

Shows full distributions
Use consistent bin widths for fair comparison

Strip Chart

Shows individual obs.
Reveals clustering, overlap, outliers

Violin Plot

Shows probability densities
Good for comparing shape (needs large sample size)

Scatterplots visualize relationships between two numerical variables

Goal: assess how one numerical variable changes with another
Show individual observations as points
Patterns reveal direction, strength, and form of association
Useful for identifying trends, clusters, and outliers
Foundation for correlation, regression, and modeling

Scatterplot showing the relationship between flipper length and body mass in penguins, illustrating how two numerical variables vary together and revealing overall trends, spread, and potential outliers.

Time series graphs show change in a variable over time

Time on x-axis (ordinal or numerical variable)
Points = observations at specific times
Lines emphasize continuity and temporal trends
Useful for detecting trends, cycles, and sudden changes
Order matters: nearby points are often related

Annual reported West Nile virus hospitalizations by case type, shown as stacked counts by year. Neuroinvasive and non-neuroinvasive cases are displayed separately to illustrate changes in both total burden and case composition over time. Source: Arbonet.

Annotations can help tell your story in a time series graph

Maps show how a variable varies over space

Location is part of the data, not just context
Values are encoded by color, size, or symbols
Useful for identifying spatial patterns and gradients
Reveal clustering, hotspots, and gaps
Interpretation depends on scale and map design

Example: chloropleth maps use color shading to represent values aggregated within geographic areas.

Visualizing three or more variables

One variable mapped to position (x or y); others mapped to color, size, or shape
Faceting splits data into small multiples for clearer comparisons
Interaction between variables becomes the focus, not individual effects
Choose encodings intentionally to avoid clutter and misinterpretation
Add variables only if they clarify the question being asked

Example: Scatterplot (two numerical variables) with colors (categorical variable)

Example: Scatterplot small multiples comparing penguin body mass and flipper length across species and sex

Scatterplots showing the relationship between flipper length and body mass in penguins, facetted by sex (rows) and species (columns), illustrating how this relationship varies across biological groups.

Lecture 4 Visualizing Data Types

Why Graphs Matter

In data visualization, form follows data structure

Visualizing One Variable

Distributions: Looking at One Variable

One Categorical Variable: Barcharts

Tables can be used if necessary

Order factors meaningfully

Visualizing One Numerical Variable

Example: Gettysburg Address

Histograms display the distribution of a numerical variable using columns

Choosing Bin Widths

Density plots show the shape of a numerical distribution using a smooth curve

Cumulative frequency distributions show quantiles and cumulative proportions

CDFs are used when how values accumulate matters

Describing a distribution means describing patterns in data

Distribution shape describes the overall pattern of values

Location describes where a distribution is centered

Spread describes how variable the data are

Outliers are observations that do not follow the main pattern

Visualizing Two Variables

Visualizing relationships between two categorical variables

Example: Is Breeding Dangerous for Birds?

Grouped bar graphs compare category frequencies across groups

Stacked bar graphs show composition within groups

Mosaic plots show association between two categorical variables

Visualizing one categorical and one numerical variable

Compare the distribution of a numerical variable across categories

Scatterplots visualize relationships between two numerical variables

Time series graphs show change in a variable over time

Annotations can help tell your story in a time series graph

Maps show how a variable varies over space

Visualizing three or more variables

Visualizing three or more variables

Example: Scatterplot (two numerical variables) with colors (categorical variable)

Example: Scatterplot small multiples comparing penguin body mass and flipper length across species and sex

Lecture 4
Visualizing Data Types