Lecture 1
Dr. Elijah Meyer
Duke University
STA 199 - Summer 2023
May 17th, 2023
Get organized
Please share with your neighbors:
“Data science is a concept to unify statistics, data analysis, machine learning and their related methods in order to understand and analyze actual phenomena with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science.”
Learn to explore, visualize, and analyze data in a reproducible and shareable manner
Gain experience in data wrangling, exploratory data analysis, predictive modeling, and data visualization
Work on problems and case studies inspired by and based on real-world questions and data
Learn to effectively communicate results through written assignments and final project presentation
– Fundamentals of R
– Data visualization
– Version control with GitHub
– Reproducible reports with Quarto
– Regression
– Statistical inference
{fig.align = “center”}
Before Class
Watch lecture content videos (will locate these during website tour)
Clone the application exercise (can be done right before class; we will practice this today)
During Class
Warm up question
Mix of lecture and live coding
Homework: Individual assignments combining conceptual and computational skills.
Labs: Individual or team assignments focusing on computational skills.
Exams: Two take-home exams.
Final Project: Team project presented during the final exam period.
Application Exercises: Exercises worked on during the live lecture session.
Are not graded for the first week
Turned in on GitHub (You will have this ability after Lab-0)
What is due is what we get through in-class
Run by TA Pritam Dey
Focus on computing using R tidyverse syntax
Apply concepts from lecture to case study scenarios
Work on labs individually or in teams of 3 - 4
R for Data Science by Grolemund & Wickham (2nd ed. O’Reilly)
Introduction to Modern Statistics by Cetinkaya-Rundel & Hardin (1st ed. OpenIntro)
– Specialization in data visualization
– Computing tools to fit models
– Well respected
– You can take these skills with you
The language has grown significantly in popularity and is now used in a range of professions including software development, business analysis, statistical reporting and scientific research
GitHub, Inc., is an Internet hosting service for software development and version control.
– If you have not set up:
GitHub Account
Slack Account
Reserved a Duke Container
Please do this before the Getting to know you survey
Go to https://github.com/, and create an account (unless you already have one).
Some tips from Happy Git with R.
– Incorporate your actual name!
– Reuse your username from other contexts if you can, e. g., Twitter or Slack.
– Pick a username you will be comfortable revealing to your future boss.
– Be as unique as possible in as few characters as possible. Shorter is better than longer.
– Avoid words with special meaning in programming (e.g. NA).
Invite Example
https://slack.com/get-started#/createnew
– Reserve a STA198-1991 RStudio container
– Go to https://cmgr.oit.duke.edu/containers
– Click Reserve Container for the STA198-199 container
– We will clone this from GitHub here: https://github.com/sta199-summer-1/ae-01-summer
– You will do this every day before class & with homework & with labs
– You will not have the capability to “push changes” to GitHub (yet… this will happen during your first lab!)
https://github.com/sta199-summer-1/ae-01-summer
– Functions are (normally) verbs, followed by what they will be applied to in parentheses:
– Packages are installed with the install.packages function and loaded with the library function, once per session.
– If you are using R through the container, almost all packages are already installed for you!
library(tidyverse)
library(tidyverse)
– The tidyverse is a collection of R packages designed for data science.
– All packages share an underlying philosophy and a common grammar.
– an open-source scientific and technical publishing system
– publish high-quality articles, reports, presentations, websites, blogs, and books in HTML, PDF, MS Word, ePub, and more
– Code goes in chunks, defined by three backticks, narrative goes outside of chunks
– Every assignment / lab / project will be given to you as a Quarto document
– You will always have a Quarto template document to start with
– As we get more familiar with R, the more code you will construct on your own
mtcars
You want to create a visualization. The first thing we need to do is set up the canvas…
mtcars |>
ggplot()
mtcars |>
ggplot(
aes(
x = variable.name, y = variable.name)
)
aes: describe how variables in the data are mapped to your canvas
+
“and”
When working with ggplot functions, we will add to our canvas using +
mtcars |>
ggplot(
aes(
x = variable.name, y = variable.name)
)
+
geom_point()
– There area lot of moving parts in this course
– Coding is not learned in a day
– Ask questions often
– What is version control? Why is it important?
– What is R vs RStudio?
– What is Quarto?
– Starting to work with code!