Merging columns

Mutating joins

Often our data are spread out over different data frames, and we need to merge them together. There are several ways to do this depending on how the data are arranged.

Mutating joins merge columns. Inner joins keep observations that show up in both data frames. Outer joins keep observations from one or both data frames. Left and right joins maintain observations from the respective side, and full joins maintain all observations.

Questions

Here are two data frames:

Which type of join would produce the following data frame?

Merging flight data

Now let's work with the nycflights13 data sets.

library(nycflights13)
library(dplyr)

We're used to working with the flights data set, but let's also look at the airlines data set, which includes the carrier abbreviations and the full names of the carriers.

Let's trim down the flights data to make it more manageable.

set.seed(1)
(flights2 <- select(flights, carrier, flight, origin, dest) |> 
    slice_sample(n = 30))

1. Check whether all of the carriers in airlines are included in flights2.

##  [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
## [13]  TRUE  TRUE  TRUE FALSE

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Do airlines and flights2 have the same or different carriers?

2. Now, what if we want to merge the airline names into our flights2 data set? Which join would we use for this? Join the airlines data into the flights2 data?

flights2 |> 
  ???_join(...)

flights2 |> 
  left_join(airlines, by = "carrier")

3. What happens if you try a right_join()? How many rows are there, and where did the extra rows come from?

flights2 |> 
  right_join(airlines, by = "carrier")

Joining columns with different names

Sometimes, primary keys that link different datasets have different names. That's easy to fix with the join_by() function. All we have to do is pass the left key name and right key name separated by ==. For example join_by(left_key == right_key).

Let's create a new dataset called penguin_names that includes scientific and common names for the species in the penguins dataset.

(penguin_names <- data.frame(common_name = c("Adelie", "Chinstrap", "Gentoo"),
                            scientific_name = c("Pygoscelis adeliae", 
                                                "Pygoscelis antarcticus", 
                                                "Pygoscelis papua")))

Let's also create a simpler penguins dataset.

(penguins2 <- penguins |> 
  select(species, island, sex, year) |> 
  slice_sample(n = 3, by = species))

4. Now merge the penguins2 dataset with the penguin_names data, using species and common_name as keys.

penguins2 |> 
  ???_join(...)

penguins2 |> 
  left_join(penguin_names, by = join_by(species == common_name))

5. Shoot--the scientific name is added as the last column. How do we get it next to the species column? Copy the code from the previous exercise and add a new line to move the scientific_name column.

penguins2 |> 
  left_join(penguin_names, by = join_by(species == common_name)) |> 
  relocate(scientific_name, .after = species)

Wrap-up

Congratulations, you finished the tutorial!

To get credit for this assignment, replace my name with the first name that you submitted in the course introduction form in the code below and click Run Code to generate the text for you to submit to Canvas.

# replace my name below with your first name (surrounded by quotes)
first_name <- "Jeff"
generate_text(first_name)

Assignment complete!

Great! Copy that code into Canvas, and you're all set for this tutorial.