Data in R

Module I · Day 1 · 3 hours

Christian González Martel

christian.gonzalez@ulpgc.es

Department of Quantitative Methods in Economics and Management · ULPGC

Juan M. Hernández Guerra

juan.hernandez@ulpgc.es

Department of Quantitative Methods in Economics and Management · ULPGC

April 23, 2026

Outline

R, RStudio, Quarto, tidyverse — what each is for.
R as a calculator · expressions, assignment, functions.
Atomic types · numeric, character, logical, factor, date.
Vectors · creation, arithmetic, subsetting.
Data frames and tibbles · the rectangular data model.
Importing tourism data · CSV, Excel, SPSS, Eurostat API.

1 · R, RStudio, Quarto, tidyverse

Four pieces, one stack

	What it is	You write…
R	The language and engine.	Code that computes.
RStudio / Positron	The IDE where you work.	Nothing R-specific — it’s the editor.
Quarto	The publishing system.	Reports, slides, web pages, books.
tidyverse	A family of R packages with consistent design.	`library(tidyverse)` at the top of every script.

Install R from https://cloud.r-project.org, RStudio from https://posit.co/download/rstudio-desktop/.

The RStudio layout

┌───────────────────────────────┬──────────────────────────┐
│                               │                          │
│   Editor                      │   Environment / History  │
│   (your .R or .qmd files)     │   (objects currently     │
│                               │    in memory)            │
│                               │                          │
├───────────────────────────────┼──────────────────────────┤
│                               │                          │
│   Console                     │   Files / Plots /        │
│   (live R session)            │   Packages / Help        │
│                               │                          │
└───────────────────────────────┴──────────────────────────┘

Write in the editor, send a line with Ctrl + Enter (Cmd + Enter on Mac).
The console shows the output.
Files pane is your file browser; Git pane appears when the folder is a repo.

Projects · always work inside an `.Rproj`

Double-click quantitative-methods-master-tides.Rproj to open the project.
Working directory is fixed to the project root → paths are portable.
Packages, history and environment are isolated per project.

Tip

Never run setwd("C:/Users/yourname/..."). If you see that in any tutorial, close the tab.

Packages · extending R

Base R is a small kernel. Everything useful — tidyverse, plotting, model fitting, reading Excel, hitting the Eurostat API — ships as a package that you install separately from CRAN, the canonical mirror.

# install once per laptop
install.packages("tidyverse")
install.packages(c("readxl", "haven", "here", "eurostat"))

# load at the top of every script that uses them
library(tidyverse)
library(here)

Tip

install.packages() downloads and compiles — slow, but only once. library() is fast and goes at the top of every script. One single library(tidyverse) pulls in eight core packages (dplyr, ggplot2, tibble, readr, tidyr, purrr, stringr, forcats) — the workhorses for the rest of the course.

2 · R as a calculator

First expressions

2 + 2

[1] 4

17 %% 5        # modulo

[1] 2

2 ^ 10         # power

[1] 1024

sqrt(81)

[1] 9

log(exp(1))    # natural log of e

[1] 1

R evaluates each line top-to-bottom and prints the result. Nothing else happens.

Assignment

visitors <- 1250          # store a value in a name
nights   <- 4.2
spending <- 85

total <- visitors * nights * spending
total

[1] 446250

Use <- (Alt + - in RStudio inserts it) for assignment. Names can contain letters, digits, _ and ., but must start with a letter.

Functions

Functions take inputs and return an output:

round(4.2, digits = 0)

[1] 4

mean(c(2, 5, 7, 10, 14))

[1] 7.6

seq(from = 2024, to = 2026, by = 1)

[1] 2024 2025 2026

Tip

Type mean( and press F1 in RStudio to open the help page for the function. ?mean does the same from the console.

Getting help

?mean              # help for one function
??"linear model"   # search across packages
example(lm)        # run the documented examples

Sites you will use a lot:

https://stat.ethz.ch/R-manual/ — canonical reference.
https://r4ds.hadley.nz/ — R for Data Science, 2nd edition (free online).
Stack Overflow, tagged [r].

3 · Atomic types

The five types you meet every day

typeof(3.14)              # "double"   — real numbers

[1] "double"

typeof(1L)                # "integer"  — note the L suffix

[1] "integer"

typeof("Gran Canaria")    # "character"

[1] "character"

typeof(TRUE)              # "logical"

[1] "logical"

class(factor(c("a","b","a")))   # "factor"

[1] "factor"

class(Sys.Date())         # "Date"

[1] "Date"

We lump numeric (double + integer) together in practice.

Numeric

price    <- 82.50            # double
stars    <- 4L               # integer
is.numeric(price)

[1] TRUE

is.numeric(stars)

[1] TRUE

# be careful with floating-point comparisons
0.1 + 0.2 == 0.3             # FALSE!

[1] FALSE

all.equal(0.1 + 0.2, 0.3)    # TRUE

[1] TRUE

Character

island <- "Lanzarote"
nchar(island)

[1] 9

toupper(island)

[1] "LANZAROTE"

# paste strings together
paste("Hotel", island, sep = " · ")

[1] "Hotel · Lanzarote"

# case matters
"Lanzarote" == "lanzarote"

[1] FALSE

Logical

is_open <- TRUE
T == TRUE                # TRUE — but don't use T, it can be overwritten

[1] TRUE

sum(c(TRUE, FALSE, TRUE, TRUE))   # 3 — logicals coerce to 0/1

[1] 3

Operators return logicals:

82.50 > 70

[1] TRUE

"Lanzarote" %in% c("Tenerife", "Lanzarote", "La Palma")

[1] TRUE

Factor · categories with known levels

island <- factor(
  c("Gran Canaria", "Tenerife", "Gran Canaria", "Lanzarote"),
  levels = c("Gran Canaria", "Tenerife", "Lanzarote",
             "Fuerteventura", "La Palma", "La Gomera", "El Hierro")
)

levels(island)

[1] "Gran Canaria"  "Tenerife"      "Lanzarote"     "Fuerteventura"
[5] "La Palma"      "La Gomera"     "El Hierro"

table(island)

island
 Gran Canaria      Tenerife     Lanzarote Fuerteventura      La Palma 
            2             1             1             0             0 
    La Gomera     El Hierro 
            0             0

Factors matter when the order of categories is meaningful (e.g. low, medium, high) or when all levels should appear in a table even if some have zero observations.

Date and time

today <- Sys.Date()
today

[1] "2026-04-29"

class(today)

[1] "Date"

# arithmetic on dates
today + 30

[1] "2026-05-29"

as.Date("2026-06-08") - today     # days to the first exam

Time difference of 40 days

format(today, "%A, %d %B %Y")     # pretty-print

[1] "Wednesday, 29 April 2026"

Use the lubridate package (inside the tidyverse) for anything more serious — we’ll cover it in Data wrangling.

Coercion gotchas

c(1, 2, "three")      # everything becomes character

[1] "1"     "2"     "three"

c(1, 2, TRUE)         # TRUE becomes 1 — all numeric

[1] 1 2 1

c(1L, 2L, 3.14)       # integer → double

[1] 1.00 2.00 3.14

R’s rule: when mixing types in one vector, coerce up to the most flexible type (character > double > integer > logical).

4 · Vectors

One type per vector

An atomic vector is homogeneous: every element must be of the same type — all double, or all integer, or all character, or all logical, etc.

typeof(c(1.5, 2.7, 3.0))          # "double"

[1] "double"

typeof(c(1L, 2L, 3L))             # "integer"

[1] "integer"

typeof(c("GC", "TF", "LZ"))       # "character"

[1] "character"

typeof(c(TRUE, FALSE, TRUE))      # "logical"

[1] "logical"

Important

Mixing types is not forbidden — R silently coerces everything to one common type (see Coercion gotchas below), which is almost never what you want. If you need several types in one object, use a data frame: one column per type.

Creating vectors

# combine — the main way to build a vector
nights  <- c(2, 5, 7, 10, 14)
islands <- c("Gran Canaria", "Tenerife", "Lanzarote",
             "Fuerteventura", "La Palma")

# sequences
1:5

[1] 1 2 3 4 5

seq(from = 0, to = 1, by = 0.25)

[1] 0.00 0.25 0.50 0.75 1.00

rep("GC", times = 3)

[1] "GC" "GC" "GC"

length(nights)

[1] 5

Vectorised arithmetic

price   <- c(82, 95, 110, 100, 78)
revenue <- nights * price                 # element-wise — no loops
revenue

[1]  164  475  770 1000 1092

mean(revenue)

[1] 700.2

sum(revenue)

[1] 3501

range(revenue)       # min and max

[1]  164 1092

This is the point of R. Whenever you are tempted to write a for loop, ask yourself whether vectorisation solves it.

Subsetting vectors

nights

[1]  2  5  7 10 14

nights[1]             # first element — R is 1-indexed

[1] 2

nights[c(1, 3, 5)]    # several positions

[1]  2  7 14

nights[-1]            # everything EXCEPT the first

[1]  5  7 10 14

nights[nights > 5]    # logical subsetting — most useful

[1]  7 10 14

The logical form (x[condition]) is the one you will use 95 % of the time.

Named vectors

price <- c(
  "Gran Canaria"  = 82,
  "Tenerife"      = 95,
  "Lanzarote"     = 110,
  "Fuerteventura" = 100,
  "La Palma"      = 78
)

price["Tenerife"]

Tenerife 
      95

price[price > 90]

     Tenerife     Lanzarote Fuerteventura 
           95           110           100

Named vectors are a stepping stone to data frames.

5 · Data frames and tibbles

The rectangular data model

library(tibble)

hotels <- tibble(
  island = c("Gran Canaria", "Tenerife", "Lanzarote",
             "Fuerteventura", "La Palma"),
  stars  = c(4L, 5L, 4L, 3L, 3L),
  price  = c(82, 95, 110, 100, 78),
  nights = c(12.5, 18.3, 9.8, 11.2, 6.4)
)

hotels

# A tibble: 5 × 4
  island        stars price nights
  <chr>         <int> <dbl>  <dbl>
1 Gran Canaria      4    82   12.5
2 Tenerife          5    95   18.3
3 Lanzarote         4   110    9.8
4 Fuerteventura     3   100   11.2
5 La Palma          3    78    6.4

Rows = observations (one hotel per row).
Columns = variables (each has one type).
Every column has the same length.

Inspecting a data frame

dim(hotels)      # rows, columns

[1] 5 4

nrow(hotels)

[1] 5

ncol(hotels)

[1] 4

names(hotels)

[1] "island" "stars"  "price"  "nights"

head(hotels)     # first 6 rows
tail(hotels)     # last 6 rows
glimpse(hotels)  # tidyverse — compact overview
summary(hotels)  # base R — stats per column
View(hotels)     # spreadsheet view (RStudio)

Selecting columns

hotels$price           # dollar notation — a single column as a vector

[1]  82  95 110 100  78

hotels[["price"]]      # equivalent, bracket form

[1]  82  95 110 100  78

hotels[, "price"]      # still a one-column tibble

# A tibble: 5 × 1
  price
  <dbl>
1    82
2    95
3   110
4   100
5    78

mean(hotels$price)

[1] 93

Adding and modifying columns

hotels$revenue <- hotels$price * hotels$nights
hotels

# A tibble: 5 × 5
  island        stars price nights revenue
  <chr>         <int> <dbl>  <dbl>   <dbl>
1 Gran Canaria      4    82   12.5   1025 
2 Tenerife          5    95   18.3   1738.
3 Lanzarote         4   110    9.8   1078 
4 Fuerteventura     3   100   11.2   1120 
5 La Palma          3    78    6.4    499.

Later, with dplyr, we’ll write this as:

library(dplyr)
hotels <- hotels |> mutate(revenue = price * nights)

Filtering rows

# base R
hotels[hotels$stars >= 4, ]

# A tibble: 3 × 5
  island       stars price nights revenue
  <chr>        <int> <dbl>  <dbl>   <dbl>
1 Gran Canaria     4    82   12.5   1025 
2 Tenerife         5    95   18.3   1738.
3 Lanzarote        4   110    9.8   1078

# dplyr preview
library(dplyr)
hotels |> filter(stars >= 4)

# A tibble: 3 × 5
  island       stars price nights revenue
  <chr>        <int> <dbl>  <dbl>   <dbl>
1 Gran Canaria     4    82   12.5   1025 
2 Tenerife         5    95   18.3   1738.
3 Lanzarote        4   110    9.8   1078

Pick one style and stick to it. We’ll standardise on dplyr from Data wrangling onwards.

6 · Importing tourism data

Get the data first

The repo does not ship CSV/XLSX files — they are gitignored and fetched on demand from the source APIs. After cloning, run this once from the R console at the project root:

install.packages(c("eurostat", "here", "fs", "readr", "dplyr"))
source("datasets/download.R")

The script writes Eurostat and ISTAC files into datasets/raw/ and a MANIFEST.md recording filenames, source URLs and download date.

Warning

If install.packages("eurostat") fails (CRAN occasionally archives it when a transitive dependency drops), use the R-universe binary build:

options(repos = c(
  ropengov = "https://ropengov.r-universe.dev",
  CRAN     = "https://cloud.r-project.org"
))
install.packages("eurostat")

Fallback if R-universe is unreachable: remotes::install_github("rOpenGov/eurostat") (slower, compiles).

The usual suspects

Format	Package	Function
CSV	`readr`	`read_csv()`
Excel (`.xlsx`, `.xls`)	`readxl`	`read_excel()`
SPSS (`.sav`)	`haven`	`read_sav()`
Stata (`.dta`)	`haven`	`read_dta()`
JSON	`jsonlite`	`read_json()`
Eurostat	`eurostat`	`get_eurostat()`

All of these are part of — or play nicely with — the tidyverse.

CSV with `readr::read_csv`

library(readr)
library(here)

occupancy <- read_csv(
  here("datasets", "raw", "istac-nights_by_island.csv"),
  locale = locale(decimal_mark = ",", grouping_mark = ".")
)

glimpse(occupancy)

Tip

Always go through here::here(). Your script will work on the lab machines, on your laptop, and on my laptop with zero changes.

Excel with `readxl`

library(readxl)

capacity <- read_excel(
  here("datasets", "raw", "istac-hotel-capacity.xlsx"),
  sheet = "2024",
  skip  = 3           # skip the first three header rows
)

glimpse(capacity)

Excel files often have merged cells, titles and footers. Use skip and range = "B5:K120" to carve out the actual data table.

SPSS and Stata with `haven`

library(haven)

survey <- read_sav(
  here("datasets", "raw", "tourism-survey.sav")
)

# haven preserves SPSS variable and value labels
attributes(survey$origin_country)

Use as_factor(survey) to convert labelled numeric variables into R factors in one go.

Eurostat · live from the API

library(eurostat)
library(dplyr)

nights <- get_eurostat("tour_occ_nim", time_format = "date") |>
  filter(unit    == "NR",     # number of nights
         c_resid == "TOTAL",  # all residency statuses
         nace_r2 == "I551")   # hotels only

nights

The eurostat package caches downloads locally, so the slow step only happens the first time.

Reproducible paths with `here::here()`

library(here)

here()
# [1] "C:/Users/you/.../quantitative-methods-master-tides"

here("datasets", "raw", "eurostat-nights.csv")
# [1] "C:/Users/you/.../datasets/raw/eurostat-nights.csv"

here() always resolves relative to the project root (the folder with the .Rproj), regardless of which subfolder your .R or .qmd file lives in.

A first tourism pipeline

library(dplyr)
library(tibble)

hotels <- tribble(
  ~island,          ~month,  ~nights,  ~beds,
  "Gran Canaria",   "2024-06",  142500, 165000,
  "Gran Canaria",   "2024-07",  168300, 165000,
  "Tenerife",       "2024-06",  198200, 220000,
  "Tenerife",       "2024-07",  231100, 220000,
  "Lanzarote",      "2024-06",   88100, 105000,
  "Lanzarote",      "2024-07",   98400, 105000
)

hotels |>
  mutate(occupancy = nights / beds) |>
  group_by(island) |>
  summarise(mean_occupancy = mean(occupancy), .groups = "drop") |>
  arrange(desc(mean_occupancy))

# A tibble: 3 × 2
  island       mean_occupancy
  <chr>                 <dbl>
1 Tenerife              0.976
2 Gran Canaria          0.942
3 Lanzarote             0.888

This is essentially every report in tourism statistics: load · clean · group · summarise. Tomorrow we unpack each verb.

Recap

R objects climb a ladder: atomic vectors → named vectors → data frames.
Learn the five types (numeric, character, logical, factor, Date) and you understand 90 % of what R prints.
Arithmetic on vectors is element-wise — avoid loops.
tibble / data.frame is the 99 %-of-the-time workhorse.
Import with the package that matches the file: readr, readxl, haven, eurostat.
Always use here::here() for paths.

Next up

Short break, then one hour on Git & GitHub to set up your submission workflow.

After that, the Day 1 exercise: open exercises/day1/exercise-template.R, load one ISTAC CSV and one Eurostat dataset, and inspect them with glimpse(), summary() and head().

Data in R

Outline

1 · R, RStudio, Quarto, tidyverse

Four pieces, one stack

The RStudio layout

Projects · always work inside an .Rproj

Packages · extending R

2 · R as a calculator

First expressions

Assignment

Functions

Getting help

3 · Atomic types

The five types you meet every day

Numeric

Character

Logical

Factor · categories with known levels

Date and time

Coercion gotchas

4 · Vectors

One type per vector

Creating vectors

Vectorised arithmetic

Subsetting vectors

Named vectors

5 · Data frames and tibbles

The rectangular data model

Inspecting a data frame

Selecting columns

Adding and modifying columns

Filtering rows

6 · Importing tourism data

Get the data first

The usual suspects

CSV with readr::read_csv

Excel with readxl

SPSS and Stata with haven

Eurostat · live from the API

Reproducible paths with here::here()

A first tourism pipeline

Recap

Next up

Projects · always work inside an `.Rproj`

CSV with `readr::read_csv`

Excel with `readxl`

SPSS and Stata with `haven`

Reproducible paths with `here::here()`