Day 11
Dr. Elijah Meyer
Duke University
STA 199 - Summer 2023
June 12th
– Clone ae-11
– Homework 3 Due Tuesday (6-13)
– Project Proposal due tonight (6-12)
— Turn in on GitHub
— Feedback coming in the next day or two
– Will be on GitHub Issues
— Demo
What are the main differences between:
– Simple Linear Regression vs Multiple Linear Regression
– Logistic Regression vs Linear Regression
Recall: Fit a logistic regression model to predict if an email was spam or not.
– Interpret the estimate associated with exclaim_mess
– Using this output, how can we calculate probabilities of a spam email?
Note: If you are interested in log-odds, I found this small article useful here
R-squared (if number of variables are the same)
Adj-R-squared (if number of variables are same / different)
AIC
– which of multiple models is most likely to be the best model for a given data set.
– based on the data, which model is the most likely
– No “nice” definition
– Lower is better (different than Adj-R-squared)
– Lower AIC and BIC values mean that a model is considered to be closer to the ‘truth’
– glance(model1)$AIC
– Common to see AIC and Adj-R-squared used for model selection
– The differences are more clear in a theory course. In there own way, they provide evidence of the “best” model.
– In short: Use one in practice, but do not mix…
— They are built off a different theoretical foundation
Want to build a model that predicts well.
– What should that model look like?
– How do we build it?
– How do we know if it predicts well?
Which model would you prefer?
– can most definitely select an overfit model
– Overfitting is a concept in data science, which occurs when a statistical model fits exactly against its data.
– This doesn’t make sense if are goal is to predict!
Vocab…
– Testing Data Set
– Training Data Set
– ROC Curve
– Sensitivity (True Positive)
– Specificity (True Negative)
When the goal is prediction….
– When able, it may be advantageous to withhold a part of your data when creating your model
– Can use what’s withheld to evaluate how well your model predicts
– training data is the dataset you use to build your model
– roughly 80% of a larger data set
“Sandbox” for model building.
– data to be used to evaluate your model
– evaluate the predictive performance
– roughly 20% of the larger data set
– Training and Testing data sets are created at random
– Also known as sensitivity
– Probability of correctly detecting a “success”
– a result which incorrectly indicates that a particular condition or attribute is present (something is not there, but we say it is)
– Incorrectly predicting the truth of “not present”
– 1 - specificity
Specificity - how well a test can classify something who truly does not have the condition of interest
Goal: Create a spam filter (want to predict if an email is spam)
– Define a true positive
– Define a false positive
– Is a graph showing the performance of a classification model at different classification thresholds
– The larger the area under the curve is, the better the performance of the model across all thresholds
Cleaning Data EDA Modeling Statistical Inference
— Hypothesis Testing
— Confidence Intervals
How tall are Duke Students?
– What we are interested in
\(\mu\)
\(\pi\)
– Are these population parameters different than some value?
– What is a range of plausible values that the population parameter could be?
– Decide what we want to investigate
– Collect data
– We need to quantify variability!
– Assess how good your model is at prediction
– Visualize how well your model predicts new observations