Separating and uniting data

Separating data

Set-up

For this exercise, we'll use functions from the {tidyr} and {dplyr} packages, so let's load those as well as the {palmerpenguins} package for example data.

library(tidyr)
library(dplyr)
library(palmerpenguins)

Remember, tidy data refers to a specific format of data (not just being 'clean'). Tidy data meet the following criteria:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

To create tidy data, you may need to separate values from a single column into multiple columns or combine multiple columns into a a single one to make a tidy data frame.

Separating data by delimiters

If there are multiple values inside a single cell, you may want to separate() them. An important concept for this is the notion of a delimiter, which is the character(s) used to separate independent elements. For instance, if you wanted to separate an ISO 8601 date into the year, month, and day, the delimiter would be a dash -. But in this date style 05/17/2021, the delimiter is a slash /. Other times, there may not be a specific character that separates the elements, but position in a string may determine the elements.

The separate() function looks for standard delimiters as delimiters and if there is only one type will separate based on that. Separate the date column into three new columns called year, month, and day.

separate(my_dates, ...)

Let's work on a more complicated task. Some characters/symbols have special meanings when we are searching for patterns in strings. We'll see more about this later, but trust me for now. One of those characters is the period or dot. If we want to use . as a delimiter, we have to 'escape' the special meaning that it has by placing two \ in front of it (\\.). Let's say we want to carve up a number in its whole number and decimal parts. We want to use the decimal as the delimiter.

Split the bill_length_mm column into mm and decimal columns using \\. as the delimiter.

separate(penguins, ...)

Separating by position

We can also separate by position instead of a delimiter. Let's say we want to abbreviate species to only the first three letters of the species. We can separate the species name after the first three characters and then throw away the column with the characters beyond the first three. Note this is not the most efficient way to do this, and we'll learn better ways later. But for now, it illustrates how to separate based on position.

When separating by position, we give the number of characters we want in the first column to the sep argument.

For the penguins data, split species into species and trash after the third character, then remove the trash column.

penguins |> 
  separate(...) |> 
  ...(...)

penguins |> 
  separate(species, into = c("species", "trash"), sep = 3) |> 
  select(-trash)

If you want to count the characters from right to left instead of left to right, simply use a negative number.

penguins |> 
  separate(sex, into = c("firstpart", "secondpart"), sep = -4) |> 
  select(-c(bill_length_mm:body_mass_g))

Uniting data

If you need to combine multiple columns into a single one, use unite(). The default delimiter is an underscore _, so if you want something different, you need to specify it with the sep argument.

Let's take our data frame with the date separated out and combine it to reproduce my_dates by combining year, month, and day into a column called date. Make sure you run your code to check it before submitting.

unite(my_dates2, ...)

Now combine the two bill measures separated by an x to show the length by the width in a column called bill_area but keep the original columns. Also, make sure NAs don't show up in the column (check unite() documentation). It should look like this:

unite(penguins, ...)

Combinations of factors

We can find all possible combinations of factors in the data by using the expand() function. What are all possible combinations of the penguin species and island?

penguins |> 
  expand(...)

We can find all combinations of factors existing in the data by adding the nesting() function inside the expand() function. What are all existing combinations of the penguin species and island?

penguins |> 
  expand(...)

Wrap-up

Congratulations, you finished the tutorial!

To get credit for this assignment, replace my name with the first name that you submitted in the course introduction form in the code below and click Run Code to generate the text for you to submit to Canvas.

# replace my name below with your first name (surrounded by quotes)
first_name <- "Jeff"
generate_text(first_name)

Assignment complete!

Great! Copy that code into Canvas, and you're all set for this tutorial.