Data Readiness Check

Overview

Before your team starts the full EDA project, you must complete a Data Readiness Check to confirm:

  • Your team project is set up correctly (everyone can run it)
  • The dataset loads without errors
  • Basic structure, variable types, and missingness are understood.
  • You can produce and save two simple plots (one numeric distribution, one categorical distribution)

Deliverable (submit on D2L):

The PDF created by your data-readiness-check.qmd file.

Prerequisite: Team + dataset selection. See: Team Formation and Dataset Selection (link below).


Step 1 — Confirm your team and dataset

  1. Confirm all members can access the team project workspace in Posit Cloud.

  2. Add the dataset to your project in a clear location:

    • Recommended: data/

    You can skip this step if your dataset is accessed through an R package or read directly from the web.

Step 2 — Create data-readiness-check.R

Create a new script named:

  • data-readiness-check.R

This script should be only for the readiness check.

Script requirements

Your script must:

  1. Load required packages
  2. Read the dataset using tidyverse functions (e.g. read_csv(), read_excel(), read_rds(), etc.)
  3. Display the data using:
    • print() and
    • glimpse()
  4. Create two plots using ggplot2:
    • A histogram of one numeric variable
    • A bar chart of one categorical variable

Step 3 — Run the script and verify outputs

  1. Run data-readiness-check.R from a fresh R session.
  2. Confirm the script runs without errors.
  3. Confirm the console shows the data as expected
  4. Confirm the two graphs are displayed in the Viewer tab as expected

Step 4 — Write a short readiness summary

  1. Create a Quarto Document
    1. Go to File > New File > Quarto document…
    2. Click “Create Empty Document” button
  2. Install the rmarkdown package
    1. In order to create the PDF output, you will need to install the rmarkdown package. Click the yellow banner at the top of the new file prompting you to install the package.
  3. Make sure your QMD file is in Source mode, not Visual mode (buttons in top left of source pane)
  4. Copy and paste the QMD template below into your QMD file and switch to Visual Mode
  5. Copy code from your data-readiness-check.R to your QMD file as necessary.
    1. Put library commands in first R code chunk
    2. Put code to read data and clean the data in second R code chunk
    3. Put code to display data (print, glimpse) in third R code chunk
    4. Put code to display histogram and barchart in fourth and fifth R code chunks.
    5. Update the text in between the R code chunks, e.g. listing the packages and variables you will use.
  6. Click the “Render” button (blue arrow pointing right) to render the QMD file as a PDF document.
  7. Open the PDF (click the file name) to make sure it shows the desired output.

Step 5 - Submit on D2L

  1. Download the PDF document to your computer using the More > Export command in the Files tab (video example).
  2. Submit the PDF of your data readiness check on D2L

QMD Template

---
title: "EDA Project Data Readiness Check"
author: "Team number and member names"
date: today
format: typst
execute:
  echo: true
  warning: false
  message: false
  error: false
  cache: false
  freeze: false
editor: visual
---

## Load packages

This project will use the following package:

-   tidyverse - for data wranglin
-   list others here or delete this line.

```{r}

```

## Dataset

We plan to use the following dataset:

-   Dataset name:
-   Dataset source: (who created it)
-   Dataset link:
-   Dataset citation:

If the dataset was downloaded from Dryad, provide the Dryad link here:

## Reading the data

The following code reads the data:

```{r}

```

## Viewing the data

The following output shows that the data is a table with ___ rows and ___ columns.

```{r}

```

## Variable choice

We plan to use the following variables in our dataset:

-   `variable_1` - explain what it is
-   `variable_2` - explain what it is

## Variable distributions

Distribution of of one numerical variable:

```{r}

```

Distribution of one categorical variable:

```{r}

```

QMD Example with Penguins Dataset

---
title: "EDA Project Data Readiness Check"
author: "Team 1: Chris Merkord"
date: today
format: typst
execute:
  echo: true
  warning: false
  message: false
  error: false
  cache: false
  freeze: false
editor: visual
---

## Load packages

This project will use the following package:

-   tidyverse - for data wrangling
-   palmerpenguins - provides the data table

```{r}
library (tidyverse)
library(palmerpenguins)
```

## Dataset

We plan to use the following dataset:

-   Dataset name: Palmer Penguins

-   Dataset source: Data originally published in Gorman et al. (2014). R package published by Horst et al. (2020).

-   Dataset link: https://allisonhorst.github.io/palmerpenguins/

-   Dataset citations: 

    - Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081
    - Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.
    
## Reading and cleaning the data

The data is included in the palmerpenguins package, which is already loaded above. However, if we were instead reading it from a file like `penguins.csv` in a data folder, then we would do so like this:

```{r}
penguins <- read_csv("data/penguins.csv")
```

- The data were already in a clean format, with rows representing observations and columns representing varibles.
- The variable names were left as is because they were already in a format suitable for coding (no spaces, capital letters, etc).
- The variable types were left as is because they were already set appropriately. For example, species, island, and sex were already `fct` types. If they had been imported as `chr` types, we would have converted them to factors in the previous code chunk.

## Viewing the data

The following output shows that the data is a table with **344** rows and **8** columns.

```{r}
print(penguins)
glimpse(penguins)
```

## Variable choice

We plan to use the following variables in our dataset:

-   `species` - the penguin species, one of Adelie, Chinstrap, or Gentoo
-   `bill_length_mm` - bill length in mm

## Variable distributions

Visualization of one numerical variable:

```{r}
# histogram of bill lengths
ggplot(penguins, aes(x = bill_length_mm)) +
  geom_histogram()
```

Visualization of one categorical variable:

```{r}
ggplot(penguins, aes(x = species)) +
    geom_bar()

```