Dplyr + Joins

Suggested Answers

Important

Go to the course GitHub organization and locate the repo titled ae-04-YOUR_GITHUB_USERNAME to get started.

This AE is due Friday, May 26th at 11:59pm.

Data

These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data were imported directly from the Environmental Data Initiative (EDI) Data Portal, and are available for use by CC0 license (“No Rights Reserved”) in accordance with the Palmer Station Data Policy.

filter()

  • Demo: Filter the data frame by selecting the rows where the destination airport is RDU. Comment the code below.
flights |>
  filter(dest == "RDU")
# A tibble: 8,163 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      800        810     -10     949     955      -6 MQ     
 2  2013     1     1      832        840      -8    1006    1030     -24 MQ     
 3  2013     1     1      851        851       0    1032    1036      -4 EV     
 4  2013     1     1      917        920      -3    1052    1108     -16 B6     
 5  2013     1     1     1024       1030      -6    1204    1215     -11 MQ     
 6  2013     1     1     1127       1129      -2    1303    1309      -6 EV     
 7  2013     1     1     1157       1205      -8    1342    1345      -3 MQ     
 8  2013     1     1     1240       1235       5    1415    1415       0 MQ     
 9  2013     1     1     1317       1325      -8    1454    1505     -11 MQ     
10  2013     1     1     1449       1450      -1    1651    1640      11 MQ     
# … with 8,153 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
# = is the same as <-

Now, run the following code with one equals sign instead of two. Does it still work?

(=) is a Assignment operator while (==) is a Equal to operator. (=) is used for assigning the values from right to left while (==) is used for showing equality between values.

  • Demo: We can also filter using more than one condition. Here we select all rows where the destination airport is RDU and the arrival delay is less than 0. As we’ve learned, conditions within functions are separated by a ,.
flights |>
  filter(dest == "RDU" , arr_delay < 0 )
# A tibble: 4,232 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      800        810     -10     949     955      -6 MQ     
 2  2013     1     1      832        840      -8    1006    1030     -24 MQ     
 3  2013     1     1      851        851       0    1032    1036      -4 EV     
 4  2013     1     1      917        920      -3    1052    1108     -16 B6     
 5  2013     1     1     1024       1030      -6    1204    1215     -11 MQ     
 6  2013     1     1     1127       1129      -2    1303    1309      -6 EV     
 7  2013     1     1     1157       1205      -8    1342    1345      -3 MQ     
 8  2013     1     1     1317       1325      -8    1454    1505     -11 MQ     
 9  2013     1     1     1505       1510      -5    1654    1655      -1 MQ     
10  2013     1     1     1800       1800       0    1945    1951      -6 B6     
# … with 4,222 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

We can do more complex tasks using logical operators:

operator definition
< is less than?
<= is less than or equal to?
> is greater than?
>= is greater than or equal to?
== is exactly equal to?
!= is not equal to?
x & y is x AND y?
x | y is x OR y?
is.na(x) is x NA?
!is.na(x) is x not NA?
x %in% y is x in y?
!(x %in% y) is x not in y?
!x is not x?

The final operator only makes sense if x is logical (TRUE / FALSE).

  • Your turn (4 minutes): Describe what the code is doing in words.
flights |> # data set and 
  filter(dest %in% "RDU", 
         arr_delay < 0 | dep_delay < 0
         )
# A tibble: 5,308 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      800        810     -10     949     955      -6 MQ     
 2  2013     1     1      832        840      -8    1006    1030     -24 MQ     
 3  2013     1     1      851        851       0    1032    1036      -4 EV     
 4  2013     1     1      917        920      -3    1052    1108     -16 B6     
 5  2013     1     1     1024       1030      -6    1204    1215     -11 MQ     
 6  2013     1     1     1127       1129      -2    1303    1309      -6 EV     
 7  2013     1     1     1157       1205      -8    1342    1345      -3 MQ     
 8  2013     1     1     1317       1325      -8    1454    1505     -11 MQ     
 9  2013     1     1     1449       1450      -1    1651    1640      11 MQ     
10  2013     1     1     1505       1510      -5    1654    1655      -1 MQ     
# … with 5,298 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
  • What if we want to land at destinations of RDU and GSO? How does the below code change?
flights |>
  filter(dest %in% c("RDU", "GSO"),
         arr_delay < 0 | dep_delay < 0
         )
# A tibble: 6,203 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      800        810     -10     949     955      -6 MQ     
 2  2013     1     1      832        840      -8    1006    1030     -24 MQ     
 3  2013     1     1      851        851       0    1032    1036      -4 EV     
 4  2013     1     1      917        920      -3    1052    1108     -16 B6     
 5  2013     1     1     1024       1030      -6    1204    1215     -11 MQ     
 6  2013     1     1     1127       1129      -2    1303    1309      -6 EV     
 7  2013     1     1     1157       1205      -8    1342    1345      -3 MQ     
 8  2013     1     1     1317       1325      -8    1454    1505     -11 MQ     
 9  2013     1     1     1449       1450      -1    1651    1640      11 MQ     
10  2013     1     1     1505       1510      -5    1654    1655      -1 MQ     
# … with 6,193 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Why c?

Combine. Use when we have a list larger than one.

count()

  • Demo: Create a frequency table of the destination locations for flights from New York.
flights |>
  count(dest)
# A tibble: 105 × 2
   dest      n
   <chr> <int>
 1 ABQ     254
 2 ACK     265
 3 ALB     439
 4 ANC       8
 5 ATL   17215
 6 AUS    2439
 7 AVL     275
 8 BDL     443
 9 BGR     375
10 BHM     297
# … with 95 more rows
  • Demo: In which month was there the fewest number of flights? How many flights were there in that month? Hint: Type ?min into the console.
flights |>
  count(month) |>
  filter(n == min(n))
  • On which date (month + day) was there the largest number of flights? How many flights were there on that day? Comment the code below.
flights |>
  count(month, day) |>
  filter(n == max(n))
# A tibble: 1 × 3
  month   day     n
  <int> <int> <int>
1    11    27  1014

mutate()

Use mutate() to create a new variable.

  • Demo: In the code chunk below, air_time (minutes in the air) is converted to hours, and then new variable mph is created, corresponding to the miles per hour of the flight. Comment each line of code below.
flights |> 
  mutate(hours = air_time / 60, 
         mph = distance / hours) |>
select(air_time, distance, hours, mph)
# A tibble: 336,776 × 4
   air_time distance hours   mph
      <dbl>    <dbl> <dbl> <dbl>
 1      227     1400 3.78   370.
 2      227     1416 3.78   374.
 3      160     1089 2.67   408.
 4      183     1576 3.05   517.
 5      116      762 1.93   394.
 6      150      719 2.5    288.
 7      158     1065 2.63   404.
 8       53      229 0.883  259.
 9      140      944 2.33   405.
10      138      733 2.3    319.
# … with 336,766 more rows
  • Your turn (4 minutes): Create a new variable to calculate the percentage of flights in each month. What percentage of flights take place in July?
flights |> 
  count(month) |>
  mutate(perc = (n / sum(n)) *100)
# A tibble: 12 × 3
   month     n  perc
   <int> <int> <dbl>
 1     1 27004  8.02
 2     2 24951  7.41
 3     3 28834  8.56
 4     4 28330  8.41
 5     5 28796  8.55
 6     6 28243  8.39
 7     7 29425  8.74
 8     8 29327  8.71
 9     9 27574  8.19
10    10 28889  8.58
11    11 27268  8.10
12    12 28135  8.35

We can also use mutate to change the type of the same variable (think back to Monday slides)

First, show / justify why we can not currently make boxplots of dep_delay by month without altering these data. Next write the following code to create boxplots of dep_delay by month.

flights |> 
  mutate(month = as.factor(month)) 
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <fct> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013 1         1      517        515       2     830     819      11 UA     
 2  2013 1         1      533        529       4     850     830      20 UA     
 3  2013 1         1      542        540       2     923     850      33 AA     
 4  2013 1         1      544        545      -1    1004    1022     -18 B6     
 5  2013 1         1      554        600      -6     812     837     -25 DL     
 6  2013 1         1      554        558      -4     740     728      12 UA     
 7  2013 1         1      555        600      -5     913     854      19 B6     
 8  2013 1         1      557        600      -3     709     723     -14 EV     
 9  2013 1         1      557        600      -3     838     846      -8 B6     
10  2013 1         1      558        600      -2     753     745       8 AA     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
glimpse(flights)
Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

summarize()

summarize() collapses the rows into summary statistics and removes columns irrelevant to the calculation.

Below, find the mean dep_delay time.

flights |> 
  summarize(mean_dep_delay = mean(dep_delay))

Question: Why did this code return NA?

Let’s fix it! We can use na.rm to remove NAs.

flights |> 
  summarize(mean_dep_delay = mean(dep_delay, na.rm = T))

group_by()

group_by() is used for grouped operations. It’s very powerful when paired with summarise() to calculate summary statistics by group.

Here we find the mean and standard deviation of departure delay for each month. Comment each line of code below.

flights |> 
  group_by(month) |>
  summarize(mean_dep_delay = mean(dep_delay, na.rm=T),
            sd_dep_delay = sd(dep_delay, na.rm=T)
            )
# A tibble: 12 × 3
   month mean_dep_delay sd_dep_delay
   <int>          <dbl>        <dbl>
 1     1          10.0          36.4
 2     2          10.8          36.3
 3     3          13.2          40.1
 4     4          13.9          43.0
 5     5          13.0          39.4
 6     6          20.8          51.5
 7     7          21.7          51.6
 8     8          12.6          37.7
 9     9           6.72         35.6
10    10           6.24         29.7
11    11           5.44         27.6
12    12          16.6          41.9
  • Your turn (4 minutes): What is the median departure delay for each airports around NYC (origin)?
flights |>
  group_by(origin) |>
  summarize(med_dep_delay = median(dep_delay, na.rm = T))
# A tibble: 3 × 2
  origin med_dep_delay
  <chr>          <dbl>
1 EWR               -1
2 JFK               -1
3 LGA               -3

Joining Fisheries

fisheries <- read_csv("data/fisheries.csv")
Rows: 82 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): country
dbl (3): capture, aquaculture, total

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
continents <- read_csv("data/continents.csv")
Rows: 245 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Working with multiple data frames

Often instead of being provided the data you need for your analysis in a single data frame, you will need to bring information from multiple datasets together into a data frame yourself. These datasets will be linked to each other via a column (usually an identifier, something that links the two datasets together) that you can use to join them together.

There are many possible types of joins. All have the format something_join(x, y).

x <- tibble(
  value = c(1, 2, 3),
  xcol = c("x1", "x2", "x3")
  )

y <- tibble(
  value = c(1, 2, 4),
  ycol = c("y1", "y2", "y4")
  )

x
# A tibble: 3 × 2
  value xcol 
  <dbl> <chr>
1     1 x1   
2     2 x2   
3     3 x3   
y
# A tibble: 3 × 2
  value ycol 
  <dbl> <chr>
1     1 y1   
2     2 y2   
3     4 y4   

We will demonstrate each of the joins on these small, toy datasets.

Note: These functions below know to join x and y by value because each dataset has value as a column. See for yourself!

[1] "value" "xcol" 
[1] "value" "ycol" 

inner_join() join all rows in x where there are matching y values

Joining, by = "value"
# A tibble: 2 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   

left_join() keeps all values of x

left_join(x , y)
Joining, by = "value"
# A tibble: 3 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 

right_join() keeps all rows of y

right_join(x, y) 
Joining, by = "value"
# A tibble: 3 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     4 <NA>  y4   

full_join() keeping all rows (both x and y)

full_join(x, y)
Joining, by = "value"
# A tibble: 4 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 
4     4 <NA>  y4   

semi_join()returns all rows from x only that match y

semi_join(x, y)
Joining, by = "value"
# A tibble: 2 × 2
  value xcol 
  <dbl> <chr>
1     1 x1   
2     2 x2   

anti_join() reutrns all rows from x that do not match y

Joining, by = "value"
# A tibble: 1 × 2
  value xcol 
  <dbl> <chr>
1     3 x3   

Global aquaculture production

The Fisheries and Aquaculture Department of the Food and Agriculture Organization of the United Nations collects data on fisheries production of countries.

Our goal is to create a visualization of the mean share of aquaculture by continent.

Let’s start by looking at the fisheries data frame.

glimpse(fisheries)
Rows: 82
Columns: 4
$ country     <chr> "Angola", "Argentina", "Australia", "Bangladesh", "Brazil"…
$ capture     <dbl> 486490, 755226, 174629, 1674770, 705000, 629950, 233190, 8…
$ aquaculture <dbl> 655, 3673, 96847, 2203554, 581230, 172500, 2315, 200765, 9…
$ total       <dbl> 487145, 758899, 271476, 3878324, 1286230, 802450, 235505, …

We have the countries, but our goal is to make a visualization by continent. Let’s take a look at the continents data frame.

glimpse(continents)
Rows: 245
Columns: 2
$ country   <chr> "Afghanistan", "Åland Islands", "Albania", "Algeria", "Ameri…
$ continent <chr> "Asia", "Europe", "Europe", "Africa", "Oceania", "Europe", "…
  • Your turn (2 minutes):
    • Which variable(s) will we use to join the fisheries and continents data frames?
    • We want to keep all rows and columns from fisheries and add a column for corresponding continents. Which join function should we use?
  • Demo: Join the two data frames and name assign the joined data frame back to fisheries.
fisheries <- fisheries |>
  left_join(continents)
Joining, by = "country"
#same as the following

left_join(fisheries, continents)
Joining, by = c("country", "continent")
# A tibble: 82 × 5
   country    capture aquaculture   total continent
   <chr>        <dbl>       <dbl>   <dbl> <chr>    
 1 Angola      486490         655  487145 Africa   
 2 Argentina   755226        3673  758899 Americas 
 3 Australia   174629       96847  271476 Oceania  
 4 Bangladesh 1674770     2203554 3878324 Asia     
 5 Brazil      705000      581230 1286230 Americas 
 6 Cambodia    629950      172500  802450 Asia     
 7 Cameroon    233190        2315  235505 Africa   
 8 Canada      874727      200765 1075492 Americas 
 9 Chad        110000          94  110094 Africa   
10 Chile      1829238     1050117 2879355 Americas 
# … with 72 more rows