project-description-s23
Timeline
Proposal June 12th
Draft 1 June 22nd
Presentation + slides June 29th
Final report and [final GitHub repo] June 29th
Introduction
TL;DR: Ask a question you’re curious about and answer it with a dataset of your choice. This is your project in a nutshell.
May be too long, but please do read
The final project for this class will consist of analysis on a dataset of your own choosing. The dataset should already exist. You can choose the data based on your teams’ interests or based on work in other courses or research projects. The goal of this project is for you to demonstrate proficiency in the techniques we have covered in this class (and beyond, if you like) and apply them to a novel dataset in a meaningful way.
The goal is not to do an exhaustive data analysis i.e., do not calculate every statistic and procedure you have learned for every variable, but rather let me know that you are proficient at asking meaningful questions and answering them with results of data analysis, that you are proficient in using R, and that you are proficient at interpreting and presenting the results. Focus on methods that help you begin to answer your research questions. You do not have to apply every statistical procedure we learned.
The expectation is that you apply at least one statistical procedure. Statistical procedures that we will cover include confidence intervals, modeling, and hypothesis testing. If using more than one statistical procedures helps answer your question, feel free to do so. However, more does not mean better.
Once finished, think critically and critique your own methods and provide suggestions for improving your analysis. Issues pertaining to the reliability and validity of your data, and appropriateness of the statistical analysis should be discussed here.
The project is very open ended. You should start by exploring your data and create some kind of compelling visualization(s) of these data in R. There is no limit on what tools or packages you may use. You do not need to visualize all of the data at once. A single high-quality visualization will receive a much higher grade than a large number of poor-quality visualizations. Also pay attention to your presentation. Neatness, coherency, and clarity will count. All analyses must be done in RStudio, using R, and all components of the project must be reproducible (with the exception of the presentation).
You will work on the project with your lab teams.
The four primary deliverables for the final project are
- A project proposal with two dataset ideas.
- A reproducible project writeup of your analysis, with one required and one optional draft along the way.
- A presentation with slides.
You will not be submitting anything on Gradescope for the project. Submission of these deliverables will happen on GitHub and feedback will be provided as GitHub issues that you need to engage with and close. The collection of the documents in your GitHub repo will create a webpage for your project. To create the webpage go to the Build tab in RStudio, and click on Render Website.
A large part of this project is the ability to demonstrate that you can work functionally in a group setting. Thus, there may be periodic surveys given to ensure that the group dynamic is productive and everyone is contributing. An individual’s grade is subject to be different than your group members for each part of the project if there is overwhelming evidence that said individual is not contributing or communicating with the other team members. Survey evidence, GitHub commit history, and lab attendance will be used to dictate if any dedications need to be made to project component grades.
Proposal
There are two main purposes of the project proposal:
- To help you think about the project early, so you can get a head start on finding data, reading relevant literature, thinking about the questions you wish to answer, etc.
- To ensure that the data you wish to analyze, methods you plan to use, and the scope of your analysis are feasible and will allow you to be successful for this project.
Identify 2 data sets you’re interested in potentially using for the final project. If you’re unsure where to find data, you can use the list of potential data sources in the Tips + Resources section as a starting point. It may also help to think of topics you’re interested in investigating and find data sets on those topics.
Criteria for datasets
The data sets should meet the following criteria:
At least 300 observations (or approved by me)
At least 6 unique columns that are useful and not simply identifiers (or approved by me)
Data must be real
- Identifier variables such as “name”, “social security number”, etc. are not useful explanatory variables.
- If you have multiple columns with the same information (e.g. “state abbreviation” and “state name”), then they are not unique explanatory variables.
You may not use data that has previously been used in any course materials, or any derivation of data that has been used in course materials.
Please ask a member of the teaching team if you’re unsure whether your data set meets the criteria.
Resources for datasets
You can find data wherever you like, but here are some recommendations to get you started. You shouldn’t feel constrained to datasets that are already in a tidy format, you can start with data that needs cleaning and tidying, scrape data off the web, or collect your own data.
- Awesome public datasets
- Bikeshare data portal
- CDC
- Data.gov
- Data is Plural
- Durham Open Data Portal
- Edinburgh Open Data
- Election Studies
- European Statistics
- CORGIS: The Collection of Really Great, Interesting, Situated Datasets
- General Social Survey
- Google Dataset Search
- Harvard Dataverse
- International Monetary Fund
- IPUMS survey data from around the world
- Los Angeles Open Data
- NHS Scotland Open Data
- NYC OpenData
- Open access to Scotland’s official statistics
- Pew Research
- PRISM Data Archive Project
- Statistics Canada
- The National Bureau of Economic Research
- UCI Machine Learning Repository
- UK Government Data
- UNICEF Data
- United Nations Data
- United Nations Statistics Division
- US Census Data
- US Government Data
- World Bank Data
- Youth Risk Behavior Surveillance System (YRBSS)
Proposal components
For each data set, include the following:
Introduction and data
For each data set:
Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.
Address ethical concerns about the data, if any.
Research question
Your research question should be clear and easily understood by a general audience. When writing a research question, please think about the following:
What is your target population?
Is the question original?
Can the question be answered?
For each data set, include the following:
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- Statement on why this question is important.
- A description of the research topic along with a concise statement of your hypotheses on this topic.
- Identify the types of variables in your research question. Categorical? Quantitative?
Literature
A literature review is an overview of the previously published works on a topic. It is good practice to familiarize yourself with published work that has already been done on the topic you are researching. Often, a literature review spans many articles, and allows the researcher to communicate how their research fits in / extends our current understanding of a topic. For this project, we will do a “mini” literature review. For each data set:
Find one published credible article on the topic you are interested in researching.
Provide a one paragraph summary about the article.
In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.
You can find articles using Google Scholar
or some other academic search engine. Please reach out if you have any questions about this.
Glimpse of data
For each data set:
- Place the file containing your data in the
data
folder of the project repo. - Use the
glimpse()
function to provide a glimpse of the data set.
Proposal grading
Total | 15 pts |
---|---|
Introduction and data | 3 |
Research question | 3 |
Literature | 3 |
Glimpse of data | 3 |
Workflow and formatting | 3 |
Each component will be graded as follows:
Meets expectations (full credit): All required elements are completed and are accurate. The narrative is written clearly, all tables and visualizations are nicely formatted, and the work would be presentable in a professional setting.
Close to expectations (half credit): There are some elements missing and/or inaccurate. There are some issues with formatting.
Does not meet expectations (no credit): Major elements missing. Work is not neatly formatted and would not be presentable in a professional setting.
It is critical to check feedback on your project proposal. Even if you earn full credit, it may not mean that your proposal is perfect.
Reproducibility + organization
All written work (with exception of presentation slides) should be reproducible, and the GitHub repo should be neatly organized.
Points for reproducibility + organization will be based on the reproducibility of the written report and the organization of the project GitHub repo. The repo should be neatly organized as described above, there should be no extraneous files, all text in the README should be easily readable.
Draft
The purpose of the draft and peer review is to give you an opportunity to get early feedback on your analysis. Therefore, the draft and peer review will focus primarily on the exploratory data analysis and initial interpretations.
Write the draft in the report.qmd
file in your project repo after you have selected ONE data set to use and have responded to all feedback given from the proposal. Note, that your RQ can change from the proposal if you are interested in some of the new methodology we are starting to learn now.
As you work on the draft, the focus should be on the analysis and less on crafting the final report. Your draft must include a reasonable attempt at each analysis component - exploratory data analysis, prediction or modeling, and deriving initial results and conclusions.
Below is a brief description of the sections to focus on in the draft.
Draft components
Introduction and data
The introduction provides motivation and context for your research. Describe your topic (citing sources) and provide a concise, clear statement of your research question and hypotheses. Additionally, consider any potential ethical issues that you believe should be addressed (if any) regarding how your data were collected.
Then identify the source of the data, when and how it was collected, the cases, a general description of relevant variables.
Methodology
The methodology section should include visualizations and summary statistics relevant to your research question. You should also justify the choice of statistical method(s) used to answer your research question.
Results
Showcase how you arrived at answers to your research question using the techniques we have learned in class (and beyond, if you’re feeling adventurous).
Provide only the main results from your analysis. The goal is not to do an exhaustive data analysis (calculate every possible statistic and perform every possible procedure for all variables). Rather, you should demonstrate that you are proficient at asking meaningful questions and answering them using data, that you are skilled in interpreting and presenting results, and that you can accomplish these tasks using R. More is not better.
Workflow + Formatting
To earn full credit, please be mindful of the following:
Have an updated about
qmd for your project.
Responded to / Closed all issues from your proposal.
Your website must be able to be rendered by me and your lab leader prior to submission.
EVERYONE in your group must have at least 1 meaningful commit.
Pipes %>%, |> and ggplot layers + should be followed by a new line.
All binary operators should be surrounded by space. For example x + y is appropriate. x+y is not.
Draft grading
The project draft report grade is worth 10 points and will be graded by your Lab Leaders and will be graded as follows:
Meets expectations (full credit): All required elements are completed and are accurate. The narrative is written clearly, all tables and visualizations are nicely formatted, and the work would be presentable in a professional setting.
Close to expectations (half credit): There are some elements missing and/or inaccurate. There are some issues with formatting.
Does not meet expectations (no credit): Major elements missing. Work is not neatly formatted and would not be presentable in a professional setting.
It is critical to check feedback on your project proposal. Even if you earn full credit, it may not mean that your proposal is perfect.
Presentation + slides
Slides
In addition to the written report, your team will also create presentation slides and record a video presentation that summarize and showcase your project. Introduce your research question and data set, showcase visualizations, and discuss the primary conclusions. These slides should serve as a brief visual addition to your written report and will be graded for content and quality.
You can create your slides with any software you like (Keynote, PowerPoint, Google Slides, etc.). We recommend choosing an option that’s easy to collaborate with, e.g., Google Slides.
Here is a suggested outline as you think through the slides; Note, you should feel free to add additional slides to help better explain your research.
- Title Slide
- Slide 1: Introduce the topic and motivation
- Slide 2: Introduce the data
- Slide 3: Highlights from EDA
- Slide 4-5: Inference/modeling/other analysis
- Slide 6: Conclusions + future work
Presentation
Presentations will take place in class during our scheduled final exam time on June 29th from 9AM- 12PM. We will not need the full three hours. The presentation should be between 5-15 minutes. The expectation is that you present live in class (recommended). Additionally, you must attend the final session for the Q&A following your presentation. If you have an extreme circumstance that prohibits you from attending class, you must complete the following:
Communicate this with your group ASAP.
Pre-record your portion of the presentation to be played during presentation time.
Expect an email with 1-2 questions (from me) about the content from your pre-recorded section to ensure you get practice answering questions. Failure to answer these questions will impact your grade.
If you must pre-record your presentation, you may use can use any platform that works best for your group to record your presentation. Below are a few resources on recording videos:
- Recording presentations in Zoom
- Apple Quicktime for screen recording
- Windows 10 built-in screen recording functionality
- Kap for screen recording
Tips for Slides / Presentation
– When presenting, we should be focusing on YOU and not trying to read your slides.
– The less words, the better.
– Visualizations do a great job at helping the audience through your presentation.
– Visualizations can often appear to small when projected, making them more distracting than useful. Make sure that your visualizations are large enough for the audience to be able to read labels / understand what is going on.
– Tell a story. Nobody knows about the research you have done. Why are you interested in it? Why is this topic important? Quickly situating the audience here can help grab their attention prior to going into the details.
– Limitations: No research study is perfect, and it is important to reconize this (normally at the end of the presentation). What limitations did you face with data / statistical inference? If you could continue this project, what new steps might you take? etc.
Report components
Introduction and data
This section includes an introduction to the project motivation, data, and research question. Describe the data and definitions of key variables. It should also include some exploratory data analysis. All of the EDA won’t fit in the paper, so focus on the EDA for the response variable and a few other interesting variables and relationships.
Grading criteria
The research question and motivation are clearly stated in the introduction, including citations for the data source and any external research. The data are clearly described, including a description about how the data were originally collected and a concise definition of the variables relevant to understanding the report. The data cleaning process is clearly described, including any decisions made in the process (e.g., creating new variables, removing observations, etc.)
Literature Review
This section includes an overview of one previously published work on your topic of interest. This includes a summary on the author’s study design, findings/results, as well as how your study is related to + adds to the existing literature on the topic. You may use footnote citations for referenced literature.
Grading criteria
The article is clearly referenced and the connection between the article you have chosen with your research is clear. The article is well summarized to the point that the reader understands the article “at a high level”, including how the study were designed and the results that came out of it. It is clearly articulated how your study pushes the existing literature on your topic of interest, answering the question “Why is your research important?”
Methodology
This section includes any exploratory data analysis and a brief description of your statistical procedure. If you are fitting a model, explain the reasoning for the type of model you’re fitting, predictor variables considered for the model including any interactions. Additionally, show how you arrived at the final model by describing the model selection process, interactions considered, variable transformations (if needed), assessment of conditions and diagnostics, and any other relevant considerations that were part of the model fitting process.
If you are conducting a hypothesis test or creating a confidence interval, justify your decision, specifically on how it relates to your research question. Justify if you are going to use theory or simulation based techniques. Use resulting output to help address your research question.
You may use more than one statistical method to answer your research question.
Grading criteria
The analysis steps are appropriate for the data and research question. The group used a thorough and careful approach to determine analyses types and addressed any concerns over appropriateness of analyses chosen.
Results
This is where you will discuss your overall finding and describe the key results from your analysis. The goal is not to interpret every single element of an output shown, but instead to address the research questions, using the interpretations to support your conclusions. Focus on the variables that help you answer the research question and that provide relevant context for the reader.
Grading criteria
The analysis results are clearly assessed and interesting findings from the analysis are described. Interpretations are used to to support the key findings and conclusions, rather than merely listing, e.g., the interpretation of every model coefficient.
Discussion
In this section you’ll include a summary of what you have learned about your research question along with statistical arguments supporting your conclusions. In addition, discuss the limitations of your analysis and provide suggestions on ways the analysis could be improved. Any potential issues pertaining to the reliability and validity of your data and appropriateness of the statistical analysis should also be discussed here. Lastly, this section will include ideas for future work.
Grading criteria
Overall conclusions from analysis are clearly described, and the analysis results are put into the larger context of the subject matter and original research question. There is thoughtful consideration of potential limitations of the data and/or analysis, and ideas for future work are clearly described.
Organization + formatting
This is an assessment of the overall presentation and formatting of the written report.
Grading criteria
The report neatly written and organized with clear section headers and appropriately sized figures with informative labels. Numerical results are displayed with a reasonable number of digits, and all visualizations are neatly formatted. All citations and links are properly formatted. If there is an appendix, it is reasonably organized and easy for the reader to find relevant information. All code, warnings, and messages are suppressed. The main body of the written report (not including the appendix or plots) is recommended to be no longer than 10 pages. If you go past this, it is okay, but you may need to start thinking about the content of your report, and if their is “fluff” added to it.
Report grading
The written report is worth 40 points, broken down as follows
Total | 40 pts |
---|---|
Introduction/data | 5 pts |
Literature Review | 3 pts |
Methodology | 10 pts |
Results | 12 pts |
Discussion | 6 pts |
Organization + formatting | 4 pts |
Teamwork
You will be asked to fill out a surveys where you rate the contribution and teamwork of each team member by assigning a contribution percentage for each team member. Filling out the survey is a prerequisite for getting credit on the team member evaluation. If you are suggesting that an individual did less than half the expected contribution given your team size (e.g., for a team of four students, if a student contributed less than 12.5% of the total effort), please provide some explanation. If any individual gets an average peer score indicating that this was the case, their grade will be assessed accordingly and penalties may apply beyond the teamwork component of the grade.
If this is not indicated on the survey, I will assume everyone on the team equally contributed and will receive full credit for the teamwork portion of the grade.
Overall grading
The grade breakdown is as follows:
Total | 100 pts |
---|---|
Project proposal | 15 pts |
Draft report | 15 pts |
Final report | 40 pts |
Slides + presentation | 20 pts |
Teamwork | 10 pts |
Grading summary
Grading of the project will take into account the following:
- Content - What is the quality of research and/or policy question and relevancy of data to those questions?
- Correctness - Are statistical procedures carried out and explained correctly?
- Writing and Presentation - What is the quality of the statistical presentation, writing, and explanations?
- Creativity and Critical Thought - Is the project carefully thought out? Are the limitations carefully considered? Does it appear that time and effort went into the planning and implementation of the project?
A general breakdown of scoring is as follows:
- 90%-100%: Outstanding effort. Student understands how to apply all statistical concepts, can put the results into a cogent argument, can identify weaknesses in the argument, and can clearly communicate the results to others.
- 80%-89%: Good effort. Student understands most of the concepts, puts together an adequate argument, identifies some weaknesses of their argument, and communicates most results clearly to others.
- 70%-79%: Passing effort. Student has misunderstanding of concepts in several areas, has some trouble putting results together in a cogent argument, and communication of results is sometimes unclear.
- 60%-69%: Struggling effort. Student is making some effort, but has misunderstanding of many concepts and is unable to put together a cogent argument. Communication of results is unclear.
- Below 60%: Student is not making a sufficient effort.
Late work policy
There is no late work accepted on this project. Be sure to turn in your work early to avoid any technological mishaps.