Dplyr + Joins
Suggested Answers
Data
These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data were imported directly from the Environmental Data Initiative (EDI) Data Portal, and are available for use by CC0 license (“No Rights Reserved”) in accordance with the Palmer Station Data Policy.
filter()
- Demo: Filter the data frame by selecting the rows where the destination airport is RDU. Comment the code below.
flights |>
filter(dest == "RDU")
# A tibble: 8,163 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 800 810 -10 949 955 -6 MQ
2 2013 1 1 832 840 -8 1006 1030 -24 MQ
3 2013 1 1 851 851 0 1032 1036 -4 EV
4 2013 1 1 917 920 -3 1052 1108 -16 B6
5 2013 1 1 1024 1030 -6 1204 1215 -11 MQ
6 2013 1 1 1127 1129 -2 1303 1309 -6 EV
7 2013 1 1 1157 1205 -8 1342 1345 -3 MQ
8 2013 1 1 1240 1235 5 1415 1415 0 MQ
9 2013 1 1 1317 1325 -8 1454 1505 -11 MQ
10 2013 1 1 1449 1450 -1 1651 1640 11 MQ
# … with 8,153 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
# = is the same as <-
Now, run the following code with one equals sign instead of two. Does it still work?
(=) is a Assignment operator while (==) is a Equal to operator. (=) is used for assigning the values from right to left while (==) is used for showing equality between values.
- Demo: We can also filter using more than one condition. Here we select all rows where the destination airport is RDU and the arrival delay is less than 0. As we’ve learned, conditions within functions are separated by a
,
.
flights |>
filter(dest == "RDU" , arr_delay < 0 )
# A tibble: 4,232 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 800 810 -10 949 955 -6 MQ
2 2013 1 1 832 840 -8 1006 1030 -24 MQ
3 2013 1 1 851 851 0 1032 1036 -4 EV
4 2013 1 1 917 920 -3 1052 1108 -16 B6
5 2013 1 1 1024 1030 -6 1204 1215 -11 MQ
6 2013 1 1 1127 1129 -2 1303 1309 -6 EV
7 2013 1 1 1157 1205 -8 1342 1345 -3 MQ
8 2013 1 1 1317 1325 -8 1454 1505 -11 MQ
9 2013 1 1 1505 1510 -5 1654 1655 -1 MQ
10 2013 1 1 1800 1800 0 1945 1951 -6 B6
# … with 4,222 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
We can do more complex tasks using logical operators:
operator | definition |
---|---|
< |
is less than? |
<= |
is less than or equal to? |
> |
is greater than? |
>= |
is greater than or equal to? |
== |
is exactly equal to? |
!= |
is not equal to? |
x & y |
is x AND y? |
x | y |
is x OR y? |
is.na(x) |
is x NA? |
!is.na(x) |
is x not NA? |
x %in% y |
is x in y? |
!(x %in% y) |
is x not in y? |
!x |
is not x? |
The final operator only makes sense if x
is logical (TRUE / FALSE).
- Your turn (4 minutes): Describe what the code is doing in words.
# A tibble: 5,308 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 800 810 -10 949 955 -6 MQ
2 2013 1 1 832 840 -8 1006 1030 -24 MQ
3 2013 1 1 851 851 0 1032 1036 -4 EV
4 2013 1 1 917 920 -3 1052 1108 -16 B6
5 2013 1 1 1024 1030 -6 1204 1215 -11 MQ
6 2013 1 1 1127 1129 -2 1303 1309 -6 EV
7 2013 1 1 1157 1205 -8 1342 1345 -3 MQ
8 2013 1 1 1317 1325 -8 1454 1505 -11 MQ
9 2013 1 1 1449 1450 -1 1651 1640 11 MQ
10 2013 1 1 1505 1510 -5 1654 1655 -1 MQ
# … with 5,298 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
- What if we want to land at destinations of RDU and GSO? How does the below code change?
# A tibble: 6,203 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 800 810 -10 949 955 -6 MQ
2 2013 1 1 832 840 -8 1006 1030 -24 MQ
3 2013 1 1 851 851 0 1032 1036 -4 EV
4 2013 1 1 917 920 -3 1052 1108 -16 B6
5 2013 1 1 1024 1030 -6 1204 1215 -11 MQ
6 2013 1 1 1127 1129 -2 1303 1309 -6 EV
7 2013 1 1 1157 1205 -8 1342 1345 -3 MQ
8 2013 1 1 1317 1325 -8 1454 1505 -11 MQ
9 2013 1 1 1449 1450 -1 1651 1640 11 MQ
10 2013 1 1 1505 1510 -5 1654 1655 -1 MQ
# … with 6,193 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
Why c
?
Combine. Use when we have a list larger than one.
count()
- Demo: Create a frequency table of the destination locations for flights from New York.
flights |>
count(dest)
# A tibble: 105 × 2
dest n
<chr> <int>
1 ABQ 254
2 ACK 265
3 ALB 439
4 ANC 8
5 ATL 17215
6 AUS 2439
7 AVL 275
8 BDL 443
9 BGR 375
10 BHM 297
# … with 95 more rows
- Demo: In which month was there the fewest number of flights? How many flights were there in that month? Hint: Type
?min
into the console.
- On which date (month + day) was there the largest number of flights? How many flights were there on that day? Comment the code below.
mutate()
Use mutate()
to create a new variable.
- Demo: In the code chunk below,
air_time
(minutes in the air) is converted to hours, and then new variablemph
is created, corresponding to the miles per hour of the flight. Comment each line of code below.
flights |>
mutate(hours = air_time / 60,
mph = distance / hours) |>
select(air_time, distance, hours, mph)
# A tibble: 336,776 × 4
air_time distance hours mph
<dbl> <dbl> <dbl> <dbl>
1 227 1400 3.78 370.
2 227 1416 3.78 374.
3 160 1089 2.67 408.
4 183 1576 3.05 517.
5 116 762 1.93 394.
6 150 719 2.5 288.
7 158 1065 2.63 404.
8 53 229 0.883 259.
9 140 944 2.33 405.
10 138 733 2.3 319.
# … with 336,766 more rows
- Your turn (4 minutes): Create a new variable to calculate the percentage of flights in each month. What percentage of flights take place in July?
# A tibble: 12 × 3
month n perc
<int> <int> <dbl>
1 1 27004 8.02
2 2 24951 7.41
3 3 28834 8.56
4 4 28330 8.41
5 5 28796 8.55
6 6 28243 8.39
7 7 29425 8.74
8 8 29327 8.71
9 9 27574 8.19
10 10 28889 8.58
11 11 27268 8.10
12 12 28135 8.35
We can also use mutate
to change the type of the same variable (think back to Monday slides)
First, show / justify why we can not currently make boxplots of dep_delay
by month without altering these data. Next write the following code to create boxplots of dep_delay
by month.
# A tibble: 336,776 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <fct> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 517 515 2 830 819 11 UA
2 2013 1 1 533 529 4 850 830 20 UA
3 2013 1 1 542 540 2 923 850 33 AA
4 2013 1 1 544 545 -1 1004 1022 -18 B6
5 2013 1 1 554 600 -6 812 837 -25 DL
6 2013 1 1 554 558 -4 740 728 12 UA
7 2013 1 1 555 600 -5 913 854 19 B6
8 2013 1 1 557 600 -3 709 723 -14 EV
9 2013 1 1 557 600 -3 838 846 -8 B6
10 2013 1 1 558 600 -2 753 745 8 AA
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
glimpse(flights)
Rows: 336,776
Columns: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
summarize()
summarize()
collapses the rows into summary statistics and removes columns irrelevant to the calculation.
Below, find the mean dep_delay
time.
Question: Why did this code return NA
?
Let’s fix it! We can use na.rm
to remove NA
s.
group_by()
group_by()
is used for grouped operations. It’s very powerful when paired with summarise()
to calculate summary statistics by group.
Here we find the mean and standard deviation of departure delay for each month. Comment each line of code below.
flights |>
group_by(month) |>
summarize(mean_dep_delay = mean(dep_delay, na.rm=T),
sd_dep_delay = sd(dep_delay, na.rm=T)
)
# A tibble: 12 × 3
month mean_dep_delay sd_dep_delay
<int> <dbl> <dbl>
1 1 10.0 36.4
2 2 10.8 36.3
3 3 13.2 40.1
4 4 13.9 43.0
5 5 13.0 39.4
6 6 20.8 51.5
7 7 21.7 51.6
8 8 12.6 37.7
9 9 6.72 35.6
10 10 6.24 29.7
11 11 5.44 27.6
12 12 16.6 41.9
-
Your turn (4 minutes): What is the median departure delay for each airports around NYC (
origin
)?
Joining Fisheries
fisheries <- read_csv("data/fisheries.csv")
Rows: 82 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): country
dbl (3): capture, aquaculture, total
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
continents <- read_csv("data/continents.csv")
Rows: 245 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Working with multiple data frames
Often instead of being provided the data you need for your analysis in a single data frame, you will need to bring information from multiple datasets together into a data frame yourself. These datasets will be linked to each other via a column (usually an identifier, something that links the two datasets together) that you can use to join them together.
There are many possible types of joins. All have the format something_join(x, y)
.
x <- tibble(
value = c(1, 2, 3),
xcol = c("x1", "x2", "x3")
)
y <- tibble(
value = c(1, 2, 4),
ycol = c("y1", "y2", "y4")
)
x
# A tibble: 3 × 2
value xcol
<dbl> <chr>
1 1 x1
2 2 x2
3 3 x3
y
# A tibble: 3 × 2
value ycol
<dbl> <chr>
1 1 y1
2 2 y2
3 4 y4
We will demonstrate each of the joins on these small, toy datasets.
Note: These functions below know to join x
and y
by value
because each dataset has value
as a column. See for yourself!
inner_join()
join all rows in x where there are matching y values
inner_join(x, y)
Joining, by = "value"
# A tibble: 2 × 3
value xcol ycol
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
left_join()
keeps all values of x
left_join(x , y)
Joining, by = "value"
# A tibble: 3 × 3
value xcol ycol
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
right_join()
keeps all rows of y
right_join(x, y)
Joining, by = "value"
# A tibble: 3 × 3
value xcol ycol
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 4 <NA> y4
full_join()
keeping all rows (both x and y)
full_join(x, y)
Joining, by = "value"
# A tibble: 4 × 3
value xcol ycol
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
4 4 <NA> y4
semi_join()
returns all rows from x only that match y
anti_join()
reutrns all rows from x that do not match y
Global aquaculture production
The Fisheries and Aquaculture Department of the Food and Agriculture Organization of the United Nations collects data on fisheries production of countries.
Our goal is to create a visualization of the mean share of aquaculture by continent.
Let’s start by looking at the fisheries
data frame.
glimpse(fisheries)
Rows: 82
Columns: 4
$ country <chr> "Angola", "Argentina", "Australia", "Bangladesh", "Brazil"…
$ capture <dbl> 486490, 755226, 174629, 1674770, 705000, 629950, 233190, 8…
$ aquaculture <dbl> 655, 3673, 96847, 2203554, 581230, 172500, 2315, 200765, 9…
$ total <dbl> 487145, 758899, 271476, 3878324, 1286230, 802450, 235505, …
We have the countries, but our goal is to make a visualization by continent. Let’s take a look at the continents data frame.
glimpse(continents)
Rows: 245
Columns: 2
$ country <chr> "Afghanistan", "Åland Islands", "Albania", "Algeria", "Ameri…
$ continent <chr> "Asia", "Europe", "Europe", "Africa", "Oceania", "Europe", "…
-
Your turn (2 minutes):
- Which variable(s) will we use to join the
fisheries
andcontinents
data frames? - We want to keep all rows and columns from
fisheries
and add a column for corresponding continents. Which join function should we use?
- Which variable(s) will we use to join the
-
Demo: Join the two data frames and name assign the joined data frame back to
fisheries
.
fisheries <- fisheries |>
left_join(continents)
Joining, by = "country"
#same as the following
left_join(fisheries, continents)
Joining, by = c("country", "continent")
# A tibble: 82 × 5
country capture aquaculture total continent
<chr> <dbl> <dbl> <dbl> <chr>
1 Angola 486490 655 487145 Africa
2 Argentina 755226 3673 758899 Americas
3 Australia 174629 96847 271476 Oceania
4 Bangladesh 1674770 2203554 3878324 Asia
5 Brazil 705000 581230 1286230 Americas
6 Cambodia 629950 172500 802450 Asia
7 Cameroon 233190 2315 235505 Africa
8 Canada 874727 200765 1075492 Americas
9 Chad 110000 94 110094 Africa
10 Chile 1829238 1050117 2879355 Americas
# … with 72 more rows