Data in R

Module I · Day 1 · 3 hours

Christian González Martel

Department of Quantitative Methods in Economics and Management · ULPGC

Juan M. Hernández Guerra

Department of Quantitative Methods in Economics and Management · ULPGC

April 23, 2026

Outline

  1. R, RStudio, Quarto, tidyverse — what each is for.
  2. R as a calculator · expressions, assignment, functions.
  3. Atomic types · numeric, character, logical, factor, date.
  4. Vectors · creation, arithmetic, subsetting.
  5. Data frames and tibbles · the rectangular data model.
  6. Importing tourism data · CSV, Excel, SPSS, Eurostat API.

1 · R, RStudio, Quarto, tidyverse

Four pieces, one stack

What it is You write…
R The language and engine. Code that computes.
RStudio / Positron The IDE where you work. Nothing R-specific — it’s the editor.
Quarto The publishing system. Reports, slides, web pages, books.
tidyverse A family of R packages with consistent design. library(tidyverse) at the top of every script.

Install R from https://cloud.r-project.org, RStudio from https://posit.co/download/rstudio-desktop/.

The RStudio layout

┌───────────────────────────────┬──────────────────────────┐
│                               │                          │
│   Editor                      │   Environment / History  │
│   (your .R or .qmd files)     │   (objects currently     │
│                               │    in memory)            │
│                               │                          │
├───────────────────────────────┼──────────────────────────┤
│                               │                          │
│   Console                     │   Files / Plots /        │
│   (live R session)            │   Packages / Help        │
│                               │                          │
└───────────────────────────────┴──────────────────────────┘
  • Write in the editor, send a line with Ctrl + Enter (Cmd + Enter on Mac).
  • The console shows the output.
  • Files pane is your file browser; Git pane appears when the folder is a repo.

Projects · always work inside an .Rproj

  • Double-click quantitative-methods-master-tides.Rproj to open the project.
  • Working directory is fixed to the project root → paths are portable.
  • Packages, history and environment are isolated per project.

Tip

Never run setwd("C:/Users/yourname/..."). If you see that in any tutorial, close the tab.

Packages · extending R

Base R is a small kernel. Everything useful — tidyverse, plotting, model fitting, reading Excel, hitting the Eurostat API — ships as a package that you install separately from CRAN, the canonical mirror.

# install once per laptop
install.packages("tidyverse")
install.packages(c("readxl", "haven", "here", "eurostat"))

# load at the top of every script that uses them
library(tidyverse)
library(here)

Tip

install.packages() downloads and compiles — slow, but only once. library() is fast and goes at the top of every script. One single library(tidyverse) pulls in eight core packages (dplyr, ggplot2, tibble, readr, tidyr, purrr, stringr, forcats) — the workhorses for the rest of the course.

2 · R as a calculator

First expressions

2 + 2
[1] 4
17 %% 5        # modulo
[1] 2
2 ^ 10         # power
[1] 1024
sqrt(81)
[1] 9
log(exp(1))    # natural log of e
[1] 1

R evaluates each line top-to-bottom and prints the result. Nothing else happens.

Assignment

visitors <- 1250          # store a value in a name
nights   <- 4.2
spending <- 85

total <- visitors * nights * spending
total
[1] 446250

Use <- (Alt + - in RStudio inserts it) for assignment. Names can contain letters, digits, _ and ., but must start with a letter.

Functions

Functions take inputs and return an output:

round(4.2, digits = 0)
[1] 4
mean(c(2, 5, 7, 10, 14))
[1] 7.6
seq(from = 2024, to = 2026, by = 1)
[1] 2024 2025 2026

Tip

Type mean( and press F1 in RStudio to open the help page for the function. ?mean does the same from the console.

Getting help

?mean              # help for one function
??"linear model"   # search across packages
example(lm)        # run the documented examples

Sites you will use a lot:

3 · Atomic types

The five types you meet every day

typeof(3.14)              # "double"   — real numbers
[1] "double"
typeof(1L)                # "integer"  — note the L suffix
[1] "integer"
typeof("Gran Canaria")    # "character"
[1] "character"
typeof(TRUE)              # "logical"
[1] "logical"
class(factor(c("a","b","a")))   # "factor"
[1] "factor"
class(Sys.Date())         # "Date"
[1] "Date"

We lump numeric (double + integer) together in practice.

Numeric

price    <- 82.50            # double
stars    <- 4L               # integer
is.numeric(price)
[1] TRUE
is.numeric(stars)
[1] TRUE
# be careful with floating-point comparisons
0.1 + 0.2 == 0.3             # FALSE!
[1] FALSE
all.equal(0.1 + 0.2, 0.3)    # TRUE
[1] TRUE

Character

island <- "Lanzarote"
nchar(island)
[1] 9
toupper(island)
[1] "LANZAROTE"
# paste strings together
paste("Hotel", island, sep = " · ")
[1] "Hotel · Lanzarote"
# case matters
"Lanzarote" == "lanzarote"
[1] FALSE

Logical

is_open <- TRUE
T == TRUE                # TRUE — but don't use T, it can be overwritten
[1] TRUE
sum(c(TRUE, FALSE, TRUE, TRUE))   # 3 — logicals coerce to 0/1
[1] 3

Operators return logicals:

82.50 > 70
[1] TRUE
"Lanzarote" %in% c("Tenerife", "Lanzarote", "La Palma")
[1] TRUE

Factor · categories with known levels

island <- factor(
  c("Gran Canaria", "Tenerife", "Gran Canaria", "Lanzarote"),
  levels = c("Gran Canaria", "Tenerife", "Lanzarote",
             "Fuerteventura", "La Palma", "La Gomera", "El Hierro")
)

levels(island)
[1] "Gran Canaria"  "Tenerife"      "Lanzarote"     "Fuerteventura"
[5] "La Palma"      "La Gomera"     "El Hierro"    
table(island)
island
 Gran Canaria      Tenerife     Lanzarote Fuerteventura      La Palma 
            2             1             1             0             0 
    La Gomera     El Hierro 
            0             0 

Factors matter when the order of categories is meaningful (e.g. low, medium, high) or when all levels should appear in a table even if some have zero observations.

Date and time

today <- Sys.Date()
today
[1] "2026-04-29"
class(today)
[1] "Date"
# arithmetic on dates
today + 30
[1] "2026-05-29"
as.Date("2026-06-08") - today     # days to the first exam
Time difference of 40 days
format(today, "%A, %d %B %Y")     # pretty-print
[1] "Wednesday, 29 April 2026"

Use the lubridate package (inside the tidyverse) for anything more serious — we’ll cover it in Data wrangling.

Coercion gotchas

c(1, 2, "three")      # everything becomes character
[1] "1"     "2"     "three"
c(1, 2, TRUE)         # TRUE becomes 1 — all numeric
[1] 1 2 1
c(1L, 2L, 3.14)       # integer → double
[1] 1.00 2.00 3.14

R’s rule: when mixing types in one vector, coerce up to the most flexible type (character > double > integer > logical).

4 · Vectors

One type per vector

An atomic vector is homogeneous: every element must be of the same type — all double, or all integer, or all character, or all logical, etc.

typeof(c(1.5, 2.7, 3.0))          # "double"
[1] "double"
typeof(c(1L, 2L, 3L))             # "integer"
[1] "integer"
typeof(c("GC", "TF", "LZ"))       # "character"
[1] "character"
typeof(c(TRUE, FALSE, TRUE))      # "logical"
[1] "logical"

Important

Mixing types is not forbidden — R silently coerces everything to one common type (see Coercion gotchas below), which is almost never what you want. If you need several types in one object, use a data frame: one column per type.

Creating vectors

# combine — the main way to build a vector
nights  <- c(2, 5, 7, 10, 14)
islands <- c("Gran Canaria", "Tenerife", "Lanzarote",
             "Fuerteventura", "La Palma")

# sequences
1:5
[1] 1 2 3 4 5
seq(from = 0, to = 1, by = 0.25)
[1] 0.00 0.25 0.50 0.75 1.00
rep("GC", times = 3)
[1] "GC" "GC" "GC"
length(nights)
[1] 5

Vectorised arithmetic

price   <- c(82, 95, 110, 100, 78)
revenue <- nights * price                 # element-wise — no loops
revenue
[1]  164  475  770 1000 1092
mean(revenue)
[1] 700.2
sum(revenue)
[1] 3501
range(revenue)       # min and max
[1]  164 1092

This is the point of R. Whenever you are tempted to write a for loop, ask yourself whether vectorisation solves it.

Subsetting vectors

nights
[1]  2  5  7 10 14
nights[1]             # first element — R is 1-indexed
[1] 2
nights[c(1, 3, 5)]    # several positions
[1]  2  7 14
nights[-1]            # everything EXCEPT the first
[1]  5  7 10 14
nights[nights > 5]    # logical subsetting — most useful
[1]  7 10 14

The logical form (x[condition]) is the one you will use 95 % of the time.

Named vectors

price <- c(
  "Gran Canaria"  = 82,
  "Tenerife"      = 95,
  "Lanzarote"     = 110,
  "Fuerteventura" = 100,
  "La Palma"      = 78
)

price["Tenerife"]
Tenerife 
      95 
price[price > 90]
     Tenerife     Lanzarote Fuerteventura 
           95           110           100 

Named vectors are a stepping stone to data frames.

5 · Data frames and tibbles

The rectangular data model

library(tibble)

hotels <- tibble(
  island = c("Gran Canaria", "Tenerife", "Lanzarote",
             "Fuerteventura", "La Palma"),
  stars  = c(4L, 5L, 4L, 3L, 3L),
  price  = c(82, 95, 110, 100, 78),
  nights = c(12.5, 18.3, 9.8, 11.2, 6.4)
)

hotels
# A tibble: 5 × 4
  island        stars price nights
  <chr>         <int> <dbl>  <dbl>
1 Gran Canaria      4    82   12.5
2 Tenerife          5    95   18.3
3 Lanzarote         4   110    9.8
4 Fuerteventura     3   100   11.2
5 La Palma          3    78    6.4
  • Rows = observations (one hotel per row).
  • Columns = variables (each has one type).
  • Every column has the same length.

Inspecting a data frame

dim(hotels)      # rows, columns
[1] 5 4
nrow(hotels)
[1] 5
ncol(hotels)
[1] 4
names(hotels)
[1] "island" "stars"  "price"  "nights"
head(hotels)     # first 6 rows
tail(hotels)     # last 6 rows
glimpse(hotels)  # tidyverse — compact overview
summary(hotels)  # base R — stats per column
View(hotels)     # spreadsheet view (RStudio)

Selecting columns

hotels$price           # dollar notation — a single column as a vector
[1]  82  95 110 100  78
hotels[["price"]]      # equivalent, bracket form
[1]  82  95 110 100  78
hotels[, "price"]      # still a one-column tibble
# A tibble: 5 × 1
  price
  <dbl>
1    82
2    95
3   110
4   100
5    78
mean(hotels$price)
[1] 93

Adding and modifying columns

hotels$revenue <- hotels$price * hotels$nights
hotels
# A tibble: 5 × 5
  island        stars price nights revenue
  <chr>         <int> <dbl>  <dbl>   <dbl>
1 Gran Canaria      4    82   12.5   1025 
2 Tenerife          5    95   18.3   1738.
3 Lanzarote         4   110    9.8   1078 
4 Fuerteventura     3   100   11.2   1120 
5 La Palma          3    78    6.4    499.

Later, with dplyr, we’ll write this as:

library(dplyr)
hotels <- hotels |> mutate(revenue = price * nights)

Filtering rows

# base R
hotels[hotels$stars >= 4, ]
# A tibble: 3 × 5
  island       stars price nights revenue
  <chr>        <int> <dbl>  <dbl>   <dbl>
1 Gran Canaria     4    82   12.5   1025 
2 Tenerife         5    95   18.3   1738.
3 Lanzarote        4   110    9.8   1078 
# dplyr preview
library(dplyr)
hotels |> filter(stars >= 4)
# A tibble: 3 × 5
  island       stars price nights revenue
  <chr>        <int> <dbl>  <dbl>   <dbl>
1 Gran Canaria     4    82   12.5   1025 
2 Tenerife         5    95   18.3   1738.
3 Lanzarote        4   110    9.8   1078 

Pick one style and stick to it. We’ll standardise on dplyr from Data wrangling onwards.

6 · Importing tourism data

Get the data first

The repo does not ship CSV/XLSX files — they are gitignored and fetched on demand from the source APIs. After cloning, run this once from the R console at the project root:

install.packages(c("eurostat", "here", "fs", "readr", "dplyr"))
source("datasets/download.R")

The script writes Eurostat and ISTAC files into datasets/raw/ and a MANIFEST.md recording filenames, source URLs and download date.

Warning

If install.packages("eurostat") fails (CRAN occasionally archives it when a transitive dependency drops), use the R-universe binary build:

options(repos = c(
  ropengov = "https://ropengov.r-universe.dev",
  CRAN     = "https://cloud.r-project.org"
))
install.packages("eurostat")

Fallback if R-universe is unreachable: remotes::install_github("rOpenGov/eurostat") (slower, compiles).

The usual suspects

Format Package Function
CSV readr read_csv()
Excel (.xlsx, .xls) readxl read_excel()
SPSS (.sav) haven read_sav()
Stata (.dta) haven read_dta()
JSON jsonlite read_json()
Eurostat eurostat get_eurostat()

All of these are part of — or play nicely with — the tidyverse.

CSV with readr::read_csv

library(readr)
library(here)

occupancy <- read_csv(
  here("datasets", "raw", "istac-nights_by_island.csv"),
  locale = locale(decimal_mark = ",", grouping_mark = ".")
)

glimpse(occupancy)

Tip

Always go through here::here(). Your script will work on the lab machines, on your laptop, and on my laptop with zero changes.

Excel with readxl

library(readxl)

capacity <- read_excel(
  here("datasets", "raw", "istac-hotel-capacity.xlsx"),
  sheet = "2024",
  skip  = 3           # skip the first three header rows
)

glimpse(capacity)

Excel files often have merged cells, titles and footers. Use skip and range = "B5:K120" to carve out the actual data table.

SPSS and Stata with haven

library(haven)

survey <- read_sav(
  here("datasets", "raw", "tourism-survey.sav")
)

# haven preserves SPSS variable and value labels
attributes(survey$origin_country)

Use as_factor(survey) to convert labelled numeric variables into R factors in one go.

Eurostat · live from the API

library(eurostat)
library(dplyr)

nights <- get_eurostat("tour_occ_nim", time_format = "date") |>
  filter(unit    == "NR",     # number of nights
         c_resid == "TOTAL",  # all residency statuses
         nace_r2 == "I551")   # hotels only

nights

The eurostat package caches downloads locally, so the slow step only happens the first time.

Reproducible paths with here::here()

library(here)

here()
# [1] "C:/Users/you/.../quantitative-methods-master-tides"

here("datasets", "raw", "eurostat-nights.csv")
# [1] "C:/Users/you/.../datasets/raw/eurostat-nights.csv"

here() always resolves relative to the project root (the folder with the .Rproj), regardless of which subfolder your .R or .qmd file lives in.

A first tourism pipeline

library(dplyr)
library(tibble)

hotels <- tribble(
  ~island,          ~month,  ~nights,  ~beds,
  "Gran Canaria",   "2024-06",  142500, 165000,
  "Gran Canaria",   "2024-07",  168300, 165000,
  "Tenerife",       "2024-06",  198200, 220000,
  "Tenerife",       "2024-07",  231100, 220000,
  "Lanzarote",      "2024-06",   88100, 105000,
  "Lanzarote",      "2024-07",   98400, 105000
)

hotels |>
  mutate(occupancy = nights / beds) |>
  group_by(island) |>
  summarise(mean_occupancy = mean(occupancy), .groups = "drop") |>
  arrange(desc(mean_occupancy))
# A tibble: 3 × 2
  island       mean_occupancy
  <chr>                 <dbl>
1 Tenerife              0.976
2 Gran Canaria          0.942
3 Lanzarote             0.888

This is essentially every report in tourism statistics: load · clean · group · summarise. Tomorrow we unpack each verb.

Recap

  • R objects climb a ladder: atomic vectors → named vectors → data frames.
  • Learn the five types (numeric, character, logical, factor, Date) and you understand 90 % of what R prints.
  • Arithmetic on vectors is element-wise — avoid loops.
  • tibble / data.frame is the 99 %-of-the-time workhorse.
  • Import with the package that matches the file: readr, readxl, haven, eurostat.
  • Always use here::here() for paths.

Next up

Short break, then one hour on Git & GitHub to set up your submission workflow.

After that, the Day 1 exercise: open exercises/day1/exercise-template.R, load one ISTAC CSV and one Eurostat dataset, and inspect them with glimpse(), summary() and head().