Proposal due Thur, Oct 20
Draft 1 due M, Nov 7
Draft 2 due Tue, Nov 29 (optional)
Peer review due Wed, Nov 30
Presentation + slides and final GitHub repo due Thu, Dec 8
Final report due Thu, Dec 8
TL;DR: Ask a question you’re curious about and answer it with a dataset of your choice. This is your project in a nutshell.
May be too long, but please do read
The final project for this class will consist of analysis on a dataset of your own choosing. The dataset may already exist, or you may collect your own data using a survey or by conducting an experiment. You can choose the data based on your teams’ interests or based on work in other courses or research projects. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like) and apply them to a novel dataset in a meaningful way.
The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather let me know that you are proficient at asking meaningful questions and answering them with results of data analysis, that you are proficient in using R, and that you are proficient at interpreting and presenting the results. Focus on methods that help you begin to answer your research questions. You do not have to apply every statistical procedure we learned. Also, critique your own methods and provide suggestions for improving your analysis. Issues pertaining to the reliability and validity of your data, andappropriateness of the statistical analysis should be discussed here.
The project is very open ended. You should create some kind of compelling visualization(s) of this data in R. There is no limit on what tools or packages you may use but sticking to packages we learned in class is required. You do not need to visualize all of the data at once. A single high-quality visualization will receive a much higher grade than a large number of poor-quality visualizations. Also pay attention to your presentation. Neatness, coherency, and clarity will count. All analyses must be done in RStudio, using R, and all components of the project must be reproducible (with the exception of the presentation).
You will work on the project with your lab teams.
The four primary deliverables for the final project are
You will not be submitting anything on Gradescope for the project. Submission of these deliverables will happen on GitHub and feedback will be provided as GitHub issues that you need to engage with and close. The collection of the documents in your GitHub repo will create a webpage for your project. To create the webpage go to the Build tab in RStudio, and click on Render Website.
There are two main purposes of the project proposal:
Identify 3 data sets you’re interested in potentially using for the final project. If you’re unsure where to find data, you can use the list of potential data sources in the Tips + Resources section as a starting point. It may also help to think of topics you’re interested in investigating and find data sets on those topics.
Important
You must use one of the data sets in the proposal for the final project, unless instructed otherwise when given feedback.
The data sets should meet the following criteria:
At least 500 observations (or approved by me)
At least 8 columns (or approved by me)
At least 6 of the columns must be useful and unique explanatory variables (or approved by me)
You may not use data that has previously been used in any course materials, or any derivation of data that has been used in course materials.
We strongly recommend curating at least one of your datasets via web scraping.
Please ask a member of the teaching team if you’re unsure whether your data set meets the criteria.
If you set your hearts on a dataset that has fewer observations or variables than what’s suggested here, that might still be ok; use these numbers as guidance for a successful proposal, not as minimum requirements.
You can find data wherever you like, but here are some recommendations to get you started. You shouldn’t feel constrained to datasets that are already in a tidy format, you can start with data that needs cleaning and tidying, scrape data off the web, or collect your own data.
For each data set, include the following:
For each data set:
Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.
Address ethical concerns about the data, if any.
Your research question should contain at least three variables, and should be a mix of categorical and quantitative variables. When writing a research question, please think about the following:
What is your target population?
Is the question original?
Can the question be answered?
For each data set, include the following:
For each data set:
data
folder of the project repo.glimpse()
function to provide a glimpse of the data set.Total | 10 pts |
---|---|
Introduction and data | 3 |
Research question | 3 |
Glimpse of data | 3 |
Workflow and formatting | 1 |
Each component will be graded as follows:
Meets expectations (full credit): All required elements are completed and are accurate. The narrative is written clearly, all tables and visualizations are nicely formatted, and the work would be presentable in a professional setting.
Close to expectations (half credit): There are some elements missing and/or inaccurate. There are some issues with formatting.
Does not meet expectations (no credit): Major elements missing. Work is not neatly formatted and would not be presentable in a professional setting.
It is critical to check feedback on your project proposal. Even if you earn full credit, it may not mean that your proposal is perfect.
The purpose of the draft and peer review is to give you an opportunity to get early feedback on your analysis. Therefore, the draft and peer review will focus primarily on the exploratory data analysis and initial interpretations.
Write the draft in the report.qmd
file in your project repo.
As you work on the draft, the focus should be on the analysis and less on crafting the final report. Your draft must include a reasonable attempt at each analysis component - exploratory data analysis, inference or modeling, and deriving initial results and conclusions.
Below is a brief description of the sections to focus on in the draft.
The introduction provides motivation and context for your research. Describe your topic (citing sources) and provide a concise, clear statement of your research question and hypotheses.
Then identify the source of the data, when and how it was collected, the cases, a general description of relevant variables.
The methodology section should include visualizations and summary statistics relevant to your research question. You should also justify the choice of statistical method(s) used to answer your research question.
Showcase how you arrived at answers to your research question using the techniques we have learned in class (and beyond, if you’re feeling adventurous).
Provide only the main results from your analysis. The goal is not to do an exhaustive data analysis (calculate every possible statistic and perform every possible procedure for all variables). Rather, you should demonstrate that you are proficient at asking meaningful questions and answering them using data, that you are skilled in interpreting and presenting results, and that you can accomplish these tasks using R. More is not better.
Your first draft will be reviewed and graded by your TAs. We recommend you incorporate their suggestions into your second (optional) draft before the second round of feedback by your peers.
Critically reviewing others’ work is a crucial part of the scientific process, and STA 199 is no exception. You will be assigned two teams to review. This feedback is intended to help you create a high quality final project, as well as give you experience reading and constructively critiquing the work of others.
During the peer feedback process, you will be provided read-only access to your partner team’s GitHub repo. You will provide your feedback in the form of GitHub issues to your partner team’s GitHub repo.
Spend ~30 mins to review each team’s project.
Peer review by: [NAME OF TEAM DOING THE REVIEW]
Names of team members that participated in this review: [FULL NAMES OF TEAM MEMBERS DOING THE REVIEW]
Describe the goal of the project.
Describe the data used or collected, if any. If the proposal does not include the use of a specific dataset, comment on whether the project would be strengthened by the inclusion of a dataset.
Describe the approaches, tools, and methods that will be used.
Is there anything that is unclear from the proposal?
Provide constructive feedback on how the team might be able to improve their project. Make sure your feedback includes at least one comment on the statistical modeling aspect of the project, but do feel free to comment on aspects beyond the modeling.
What aspect of this project are you most interested in and would like to see highlighted in the presentation.
Provide constructive feedback on any issues with file and/or code organization.
(Optional) Any further comments or feedback?
Peer reviews will be graded on the extent to which it comprehensively and constructively addresses the components of the partner team’s report: the research context and motivation, exploratory data analysis, and any inference, modeling, or conclusions.
Your written report must be completed in the report.qmd
file and must be reproducible. All team members should contribute to the GitHub repository, with regular meaningful commits.
Before you finalize your write up, make sure the printing of code chunks is off with the option echo: false
in the YAML.
The mandatory components of the report are below. You are free to add additional sections as necessary. The report, including visualizations, should be no more than 10 pages long. There is no minimum page requirement; however, you should comprehensively address all of the analysis in your report.
Note
To check how many pages your report is, open it in your browser and go to File > Print > Save as PDF and review the number of pages.
Be selective in what you include in your final write-up. The goal is to write a cohesive narrative that demonstrates a thorough and comprehensive analysis rather than explain every step of the analysis.
You are welcome to include an appendix with additional work at the end of the written report document; however, grading will largely be based on the content in the main body of the report. You should assume the reader will not see the material in the appendix unless prompted to view it in the main body of the report. The appendix should be neatly formatted and easy for the reader to navigate. It is not included in the 10-page limit.
This section includes an introduction to the project motivation, data, and research question. Describe the data and definitions of key variables. It should also include some exploratory data analysis. All of the EDA won’t fit in the paper, so focus on the EDA for the response variable and a few other interesting variables and relationships.
The research question and motivation are clearly stated in the introduction, including citations for the data source and any external research. The data are clearly described, including a description about how the data were originally collected and a concise definition of the variables relevant to understanding the report. The data cleaning process is clearly described, including any decisions made in the process (e.g., creating new variables, removing observations, etc.) The explanatory data analysis helps the reader better understand the observations in the data along with interesting and relevant relationships between the variables. It incorporates appropriate visualizations and summary statistics.
This section includes a brief description of your modeling process. Explain the reasoning for the type of model you’re fitting, predictor variables considered for the model including any interactions. Additionally, show how you arrived at the final model by describing the model selection process, interactions considered, variable transformations (if needed), assessment of conditions and diagnostics, and any other relevant considerations that were part of the model fitting process.
The analysis steps are appropriate for the data and research question. The group used a thorough and careful approach to determine analyses types and addressed any concerns over appropriateness of analyses chosen.
This is where you will discuss your overall finding and describe the key results from your analysis. The goal is not to interpret every single element of an output shown, but instead to address the research questions, using the interpretations to support your conclusions. Focus on the variables that help you answer the research question and that provide relevant context for the reader.
The analysis results are clearly assessed and interesting findings from the analysis are described. Interpretations are used to to support the key findings and conclusions, rather than merely listing, e.g., the interpretation of every model coefficient.
In this section you’ll include a summary of what you have learned about your research question along with statistical arguments supporting your conclusions. In addition, discuss the limitations of your analysis and provide suggestions on ways the analysis could be improved. Any potential issues pertaining to the reliability and validity of your data and appropriateness of the statistical analysis should also be discussed here. Lastly, this section will include ideas for future work.
Overall conclusions from analysis are clearly described, and the analysis results are put into the larger context of the subject matter and original research question. There is thoughtful consideration of potential limitations of the data and/or analysis, and ideas for future work are clearly described.
This is an assessment of the overall presentation and formatting of the written report.
The report neatly written and organized with clear section headers and appropriately sized figures with informative labels. Numerical results are displayed with a reasonable number of digits, and all visualizations are neatly formatted. All citations and links are properly formatted. If there is an appendix, it is reasonably organized and easy for the reader to find relevant information. All code, warnings, and messages are suppressed. The main body of the written report (not including the appendix) is no longer than 10 pages.
The written report is worth 40 points, broken down as follows
Total | 40 pts |
---|---|
Introduction/data | 6 pts |
Methodology | 10 pts |
Results | 14 pts |
Discussion | 6 pts |
Organization + formatting | 4 pts |
In addition to the written report, your team will also create presentation slides and record a video presentation that summarize and showcase your project. Introduce your research question and data set, showcase visualizations, and discuss the primary conclusions. These slides should serve as a brief visual addition to your written report and will be graded for content and quality.
You can create your slides with any software you like (Keynote, PowerPoint, Google Slides, etc.). We recommend choosing an option that’s easy to collaborate with, e.g., Google Slides.
Note
You can also use Quarto to make your slides! While we won’t be covering making slides with Quarto in the class, we would be happy to help you with it in office hours. It’s no different than writing other documents with Quarto, so the learning curve will not be steep!
The slide deck should have no more than 6 content slides + 1 title slide. Here is a suggested outline as you think through the slides; you do not have to use this exact format for the 6 slides.
Presentations will take place in class during the last lab of the semester. The presentation must be no longer than 5 minutes. You can choose to present live in class (recommended) or pre-record a video to be shown in class. Either way you must attend the lab session for the Q&A following your presentation.
If you choose to pre-record your presentation, you may use can use any platform that works best for your group to record your presentation. Below are a few resources on recording videos:
Once your video is ready, upload the video to Warpwire or another video platform (e.g., YouTube), then add a link to your video in your repo README.
To upload your video to Warpwire:
All written work (with exception of presentation slides) should be reproducible, and the GitHub repo should be neatly organized.
Points for reproducibility + organization will be based on the reproducibility of the written report and the organization of the project GitHub repo. The repo should be neatly organized as described above, there should be no extraneous files, all text in the README should be easily readable.
You will be asked to fill out a survey where you rate the contribution and teamwork of each team member by assigning a contribution percentage for each team member. Filling out the survey is a prerequisite for getting credit on the team member evaluation. If you are suggesting that an individual did less than half the expected contribution given your team size (e.g., for a team of four students, if a student contributed less than 12.5% of the total effort), please provide some explanation. If any individual gets an average peer score indicating that this was the case, their grade will be assessed accordingly and penalties may apply beyond the teamwork component of the grade.
If you have concerns with the teamwork and/or contribution from any team members, please email me by the project presentation deadline. You only need to email me if you have concerns. Otherwise, I will assume everyone on the team equally contributed and will receive full credit for the teamwork portion of the grade.
The grade breakdown is as follows:
Total | 100 pts |
---|---|
Project proposal | 10 pts |
Draft report | 10 pts |
Peer review | 5 pts |
Final report | 40 pts |
Slides + presentation | 20 pts |
Reproducibility + organization | 5 pts |
Teamwork | 10 pts |
Grading of the project will take into account the following:
A general breakdown of scoring is as follows:
There is no late work accepted on this project. Be sure to turn in your work early to avoid any technological mishaps.