Data wrangling with the tidyverse

Topic 1 · Day 2 · 4 hours

Christian González Martel

Department of Quantitative Methods in Economics and Management · ULPGC

Juan M. Hernández Guerra

Department of Quantitative Methods in Economics and Management · ULPGC

April 29, 2026

Branch before you edit

Yesterday a few of you committed straight to main. Easy to fix, but let’s get the muscle memory right today.

The two clicks, every day, before opening the exercise template:

  1. Pull · Git pane → blue down arrow (after Sync fork on GitHub).
  2. New Branch · Git pane → branch dropdown → New Branch… → name day-2 → tick Sync branch with remoteCreate.

Look at the branch selector at the top of the Git pane. If it does not say day-2, stop and fix it before you type anything in the template.

Tip

Mantra: Sync · Pull · Branch · then edit.

Outline

  • The pipe |> and the tidyverse grammar.
  • dplyr verbs: filter, select, mutate, arrange, summarise, group_by.
  • Reshaping with tidyr: pivot_longer, pivot_wider, separate, unite.
  • Joining tables: left_join, inner_join, anti_join.
  • Missing values and type coercion gotchas.

The pipe

hotels |>
  filter(nights > 7) |>
  mutate(revenue = nights * price) |>
  arrange(desc(revenue))

Read as: take hotels, keep long stays, compute revenue, sort descending.

The six dplyr verbs

Verb Purpose
filter() keep rows that match a condition
select() keep / drop columns
mutate() add or transform columns
arrange() sort rows
summarise() collapse to one row per group
group_by() set the grouping context

Group-wise summaries · Eurostat nights spent

library(dplyr)

nights |>
  filter(year == 2024, purpose == "total") |>
  group_by(country) |>
  summarise(
    total_nights = sum(value, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(total_nights))

Wide vs long · pivot_longer()

library(tidyr)

nights_wide <- nights |>
  pivot_wider(names_from = year, values_from = value)

nights_wide |>
  pivot_longer(cols = `2019`:`2024`,
               names_to = "year", values_to = "value")

Joins · occupancy + capacity

occupancy |>
  left_join(capacity, by = c("island", "year", "month")) |>
  mutate(occupancy_rate = nights / available_beds)

Know your joins:

  • left_join() — keep every row of the left table.
  • inner_join() — keep only matches.
  • anti_join() — rows from left without a match on the right.

NA handling

x <- c(2, NA, 7, 4)
mean(x)                # NA
[1] NA
mean(x, na.rm = TRUE)  # 4.33
[1] 4.333333
sum(is.na(x))          # count NAs
[1] 1

Recap

  • Pipe |> chains verbs, reads left-to-right.
  • group_by() |> summarise() is the bread and butter of reporting.
  • Always pivot when a column stores values, not variables.

Hands-on

Join an ISTAC occupancy file with an ISTAC capacity file, compute monthly occupancy rate by island, and produce a summary table for 2024.