Day-2
Dr. Elijah Meyer
Duke University
STA 199 - Summer 2023
May 18th 2023
– Go to https://github.com/sta199-summer-1/ae-02-summer and find your ae-02-summer
– Clone the repo in your container, open the Quarto document in the repo (as we did with ae-01)
– Have you filled out the Getting to know you survey?
– Are you on Slack?
– Prepare Material?
Due Dates + Turn In
– All AEs are due the upcoming Friday night by 11:59 - GitHub
– Labs due Thursday and Monday (before lecture)- Gradescope
– HWs due ~ 1 week from assigned - Gradescope
Videos
– Will not post
– Can always request
Create plots!
– Understand geoms
– Scatterplots, boxplots, histograms, etc
– Practice with the fundamentals of ggplot
– Let the types of variables dictate the plot
– Informative title
– Axes should be labeled
– Careful consideration of aesthetic choices (like color)
Why is the following code so important?
library(tidyverse)
What word do we associate with |>
?
Let’s practice reading code as a sentence
mtcars |>
ggplot(
aes(
x = mpg, y = wt)
)
+
geom_point()
Q2
mtcars |>
summarize(med = median(wt))
Goal: Calculate the mean weight of cards over 4000 lbs.
Good:
mtcars |>
filter(wt > 4) |>
summarize(avg = mean(wt))
Not as good:
large_cars <- mtcars |>
filter(wt > 4)
large_cars |>
summarize(avg = mean(wt))
https://ggplot2.tidyverse.org/reference/
A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom.
But how do we talk about them?
– Look for patterns
– Look at shape
– Look for outliers or anything unusual
– Look for spread
of the data
We often talk about skew when describing shape of data
– Positive Skew (Right Skew)
– Negative Skew (Left Skew)
– Roughlyl Symmentric
A data point that does not follow the general trend of the data
– What’s the mean
– What’s the median
How spread out things are…
– Standard deviation
– IQR
names()
There are many ways we can extract variable names from our data set. I find myself using this a lot in practice.
These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data were imported directly from the Environmental Data Initiative (EDI) Data Portal, and are available for use by CC0 license (“No Rights Reserved”) in accordance with the Palmer Station Data Policy.
https://github.com/sta199-summer-1/ae-02-summer
Pick geoms based on data types.
Manipulate graphs to be more appropriate with arguments
Take control of your labels
Use color to your advantage. https://ggplot2.tidyverse.org/reference/ggtheme.html & https://ggplot2.tidyverse.org/reference/scale_viridis.html