Mutating joins
Often our data are spread out over different data frames, and we need to merge them together. There are several ways to do this depending on how the data are arranged.
Mutating joins merge columns. Inner joins keep observations that show up in both data frames. Outer joins keep observations from one or both data frames. Left and right joins maintain observations from the respective side, and full joins maintain all observations.
Questions
Here are two data frames:
x
y
Merging flight data
Now let's work with the nycflights13
data sets.
library(nycflights13)
library(dplyr)
We're used to working with the flights
data set, but let's also look at the airlines
data set, which includes the carrier abbreviations and the full names of the carriers.
Let's trim down the flights
data to make it more manageable.
set.seed(1)
(flights2 <- select(flights, carrier, flight, origin, dest) |>
slice_sample(n = 30))
1. Check whether all of the carriers in airlines
are included in flights2
.
## [1] TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
## [13] TRUE TRUE TRUE FALSE
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Do airlines
and flights2
have the same or different carriers?
2. Now, what if we want to merge the airline names into our flights2
data set? Which join would we use for this? Join the airlines
data into the flights2
data?
flights2 |>
???_join(...)
flights2 |>
left_join(airlines, by = "carrier")
3. What happens if you try a right_join()
? How many rows are there, and where did the extra rows come from?
flights2 |>
right_join(airlines, by = "carrier")
Joining columns with different names
Sometimes, primary keys that link different datasets have different names. That's easy to fix with the join_by()
function. All we have to do is pass the left key name and right key name separated by ==
. For example join_by(left_key == right_key)
.
Let's create a new dataset called penguin_names
that includes scientific and common names for the species in the penguins
dataset.
(penguin_names <- data.frame(common_name = c("Adelie", "Chinstrap", "Gentoo"),
scientific_name = c("Pygoscelis adeliae",
"Pygoscelis antarcticus",
"Pygoscelis papua")))
Let's also create a simpler penguins dataset.
(penguins2 <- penguins |>
select(species, island, sex, year) |>
slice_sample(n = 3, by = species))
4. Now merge the penguins2
dataset with the penguin_names
data, using species
and common_name
as keys.
penguins2 |>
???_join(...)
penguins2 |>
left_join(penguin_names, by = join_by(species == common_name))
5. Shoot--the scientific name is added as the last column. How do we get it next to the species column? Copy the code from the previous exercise and add a new line to move the scientific_name
column.
penguins2 |>
left_join(penguin_names, by = join_by(species == common_name)) |>
relocate(scientific_name, .after = species)
Wrap-up
Congratulations, you finished the tutorial!
To get credit for this assignment, replace my name with the first name that you submitted in the course introduction form in the code below and click Run Code to generate the text for you to submit to Canvas.
# replace my name below with your first name (surrounded by quotes)
first_name <- "Jeff"
generate_text(first_name)
Assignment complete!
Great! Copy that code into Canvas, and you're all set for this tutorial.