Datasets

For the EDA project, you will work with a real dataset that you select early in the semester. Choosing an appropriate dataset is a critical first step, as it determines what kinds of questions you can explore and how much time you will have for analysis and interpretation. Guidance and options are provided below to help you identify a dataset that is suitable for this project.

How to Choose a Dataset

Not all datasets are appropriate for this project. Before committing to a dataset, make sure it meets the following criteria:

  • The data are in a rectangular format (rows are observations, columns are variables).
  • The dataset includes at least one numerical variable and at least one categorical or grouping variable, or a clear way to define groups.
  • There are enough observations to support meaningful plots and comparisons (very small datasets are usually not appropriate).
  • The variables are clearly documented so you know what each column represents and what the units are.
  • The data can be downloaded and imported into R and shared within your Posit Cloud project.
  • The dataset supports questions about patterns, differences, or relationships, not just single summary values.

If you are unsure whether a dataset is appropriate, ask early. Waiting too long to confirm dataset suitability is the most common reason projects fall behind.

Your own dataset

If you are involved in an ongoing research project (for example, working with a faculty member at MSUM), you may be able to use data from that project for your BIOL 275 EDA project.

To use your own dataset, the following conditions must be met:

  • The dataset must be approved by the instructor before you begin analysis.

  • The data must be complete enough to support exploratory analysis and basic statistical methods used in this course.

  • The dataset must be shareable with your project team and usable within a shared Posit Cloud project.

  • The data must not include restricted, sensitive, or private information unless explicit permission has been granted.

  • You must be able to clearly explain how the data were collected and what the variables represent.

Instructor approval is based on feasibility and suitability for the course project, not on the perceived “importance” of the dataset.

Biological trait data

Species occurrence data

  • eBird data on bird observations. This is a huge dataset with many possible questions to explore. Challenge level: Difficult

  • The Botanical Information and Ecology Network brings together data on plant distribution, abundance, and traits, with the goal of predicting and mitigating the effects of climate change on plant species and communities. You can download geolocated observations and trait data, but you’d probably need to combine it with some other earth observation data like those found below. Challenge level: Moderate

  • iNaturalist. Challenge level: Easy

  • GBIF. Global Biodiversity Information Facility. Geolocated occurrence data for all species worldwide, aggregated from many other data sources. Challenge level: Moderate

    • rgbif package on GitHub - read the intro on the README for more links to vignettes, reference, articles, and a published paper

Environmental data

  • NEON. The National Science Foundation’s National Ecological Observatory Network (NEON) is a continental-scale observation facility operated by Battelle and designed to collect long-term open access ecological data to better understand how U.S. ecosystems are changing. The comprehensive data, spatial extent and remote sensing technology provided by NEON will enable a large and diverse user community to tackle new questions at scales not accessible to previous generations of ecologists. Challenge level: Difficult

    • Users can browse data products and associated documentation and then select time frames and field sites to download the data

    • The neonUtilities R package allows you to access and download NEON data as well as to work with NEON data downloaded from the portal.

Public health datasets

Many datasets for the USA can be found at:

  • National Center for Health Statistics. Includes datasets, documentation, and questionnaires from NCHS data collection systems. Some of these are included in the table below, but there are many more than what is given here.

Quite a bit of health data may be downloaded at:

  • CDC WONDER. You choose the dataset, which variables to include, and download it.
Dataset Description Spatial Coverage Spatial Resolution Temporal Coverage Temporal Resolution
Behavioral Risk Factor Surveillance System (BRFSS) Prevalence Data Prevalence data based on telephone surveys USA State 2011-present Yearly
County Health Rankings & Roadmaps You can download data by state and year (see Minnesota, for example) USA State, County Yearly
KIDS COUNT A source of data on children and families and a project of the Annie E. Casey Foundation. You choose and download the variables necessary to answer your question. USA State, County Yearly
National Comorbidity Survey (NCS) Series Prevalence, risk factors, and consequences of psychiatric morbidity and comorbidity USA Individual 1990-2004 baseline, reinterview, replication

Other pages that provide lists of available datasets:

Geospatial data

AρρEEARS

The Application for Extracting and Exploring Analysis Ready Samples (AρρEEARS) offers a simple and efficient way to access and transform geospatial data from a variety of federal data archives.

AρρEEARS enables users to subset geospatial datasets using spatial, temporal, and band/layer parameters.

Two types of sample requests are available:

  • point samples for geographic coordinates and
  • area samples for spatial areas via vector polygons.

Sample requests submitted to AρρEEARS provide users not only with data values, but also associated quality data values. Interactive visualizations with summary statistics are provided for each sample within the application, which allow users to preview and interact with their samples before downloading their data. Visit the Help page to learn more.

There are handy videos on how to use the system to get data.

  • Some datasets include:

    • Land Surface Temperature (min, max, mean)
    • Sea Surface Temperature
    • Precipitation
    • Snow cover
    • Land cover
    • Soil moisture, soil temperature
    • Vegetation indices (e.g. NDVI)
    • Gridded population data
  • The temporal and spatial range and resolution of these datasets varies.

You could explore geospatial data by itself, or if you have GPS coordinate for other types of data (e.g. georeferenced specimen or observation data) then you could use AppEEARS to extract environmental data associated with those points.

Other geospatial data

Cellular and molecular biology and biochemistry

  • The Actinobacteriophage Database at PhagesDB.org, a website that collects and shares data, pictures, protocols, and analysis tools associated with the discovery, sequencing, and characterization of mycobacteriophages—viruses that infect the Mycobacteria and also other bacterial hosts in the phylum Actinobacteria. It was developed at—and is maintained from—the Pittsburgh Bacteriophage Institute, a joint venture of Dr. Graham Hatfull and Dr. Roger Hendrix, both of the Department of Biological Sciences at the University of Pittsburgh.

Online data repositories

Datasets in R

Many R packages have datasets included.