Day 5
Dr. Elijah Meyer
Duke University
STA 199 - Summer 2023
May 25th, 2023
– Clone ae-05
– Make sure you are keeping up with Preparation Videos
– HW-1 due Monday (29th at 11:59 PM)
– Lab-1 due Today (25th at 11:59 PM) <- We will talk about this deadline
– All AEs for this week due Friday (26th at 11:59 PM)
– Group Formation
– 3-4 students
– GitHub is a tool for collaboration
– It is a skill to be able to communicate and work together on common projects
– left_join(x,y)
; right_join(x,y)
; full_join(x,y)
if_else
– If this, do this, else this
– Commonly used to create new variables
New column added
fct_reorder
iris |>
ggplot(
aes(x = fct_reorder(Species, Sepal.Width), y = Sepal.Width)
) +
geom_boxplot()
– Finish Joins
– Define Tidy Data
– Play with pivot
functions in R
– Wide data contains values that do not repeat in the first column. Also called “unstacked”. Tabular format.
– Long data contains values that do repeat in the first column. Each row is a single observation of a particular group.
– Which have we typically used to create plots in this class?
There are three interrelated rules that make a dataset tidy:
Each variable is a column; each column is a variable.
Each observation is row; each row is an observation.
Each value is a cell; each cell is a single value.
This typically describes long data
– Sometimes, data are not in this format…
– pivot_longer
– pivot_wider
– Making tables for quick comparison / display purposes
– names_to
– values_to
Look at points by game
There are many different types of joins. Think critically about your goal in order to decide which join you should use.
When pivoting longer, variable names that turn into values are characters by default. If you need them to be in another format, you need to explicitly make that transformation, which you can do so within the pivot_longer()
function.
pivot_wider()
which makes data sets wider by increasing columns and reducing rows. pivot_wider()
has the opposite interface to pivot_longer(): we need to provide the existing columns that define the values (values_from) and the column name (names_from).