Lab 2 - Group Formation Data wrangling

Lab

Important

This lab is due Monday May 29th at 11:59pm.

Learning goals

In this lab, you will…

form teams for future labs + project
use data wrangling to extract meaning from data
continue developing a workflow for reproducible data analysis
continue working with data visualization tools

Getting started

Go to the sta199-summer-1 organization on GitHub. Click on the repo with the prefix lab-02. It contains the starter documents you need to complete the lab.
Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details on cloning a repo and starting a new R project.
First, open the Quarto document lab-02.qmd and Render it.
Make sure it compiles without errors.

Group Formation

After lab-2, lab assignments will be group based, and we will shortly be starting the group project.

In this lab, you are going choose who you want to work with. Groups are to be made up of 3-4 students per group. If you would like to be assigned to a group, please reach out to the TA.

Why Teams: Rationale

In the real world, data scientists and statisticians often work in research teams. It is a skill to be able to communicate and work together on common projects. Thus, the remaining labs + your project will be team based.

Teams work is better when members have a common understanding of the team’s goals and expectations for collaboration. The purpose of this activity is to help your team making a plan for working together during lab and outside of the scheduled lab time.

Each team member will have some ideas about how a team should operate. These ideas may be very different. This is your opportunity to share your thoughts and ideas to promote optimal team function and prevent misunderstandings in the future.

Team Name

Discuss with your group a team name to be called. Your GitHub repos will be created for this team name moving forward. Report your team name to your Lab Leader before moving on.

Instructions

There are two items you need to complete when forming your group.

– Report your group members and team name to your TA to be recorded

– Fill out the team agreement

Team agreement: Discuss each of the items below with all in-person team members. If necessary, also follow up this week with any missing team members.

Have one person act as the recorder and type the team’s decisions in the team-agreement.qmd file.

Be sure the team agrees on an item before it is added to the document.

Once the document is complete, the recorder should render, commit, and push the team agreement to GitHub. All team members can refer to this document throughout the semester.

This is not graded for accuracy, and simply acts as a tool to facilitate good group work.

Team Agreement

Weekly meetings

Identify a 1 - 2 hour weekly block outside of lab where the team can meet to work on assignments. All team members should block off this time on their calendar in case the group needs to meet to finish lab or work on the project.

Meeting “location”

How the team will meet to work together (e.g. in-person, Zoom, Facetime, Google Hangouts). Be sure every member is able to access the virtual meeting space, if needed. If you are unable to find a weekly time when the team can meet, briefly outline a plan to work on assignments outside of lab. Otherwise, you can delete this item.

Primary method of communication

The team’s primary method of communication outside of meetings (e.g. Slack, text messages, etc.)

How should someone notify the other members if they are unable to attend lab or a scheduled team meeting?

By when should everyone have their portion of the lab completed?

Keep in mind your team may want to have time to review the lab before turning it in to make sure it is a cohesive write up.

Any other items the team would like to discuss or plan.

Missing Teammates

If someone is missing in your lab, and you would like them to be a part of your team, please communicate this information with both them and the TA so this can be documented.

Data Wrangling

Note: This lab still requires individual submissions!

Warm up

Before we introduce the data, let’s warm up with some simple exercises.

Update the YAML, changing the author name to your name, and render the document.
Commit your changes with a meaningful commit message.
Push your changes to GitHub.
Go to your repo on GitHub and confirm that your changes are visible in your `.qmd and .pdf files. If anything is missing, render, commit, and push again.

Packages

We’ll use the tidyverse package for much of the data wrangling. This package is already installed for you. You can load it by running the following in your Console:

library(tidyverse)

Data

The dataset for this assignment can be found as a CSV (comma separated values) file in the data folder of your repository. You can read it in using the following.

nobel <- read_csv("data/nobel.csv")

The descriptions of the variables are as follows:

id: ID number
firstname: First name of laureate
surname: Surname
year: Year prize won
category: Category of prize
affiliation: Affiliation of laureate
city: City of laureate in prize year
country: Country of laureate in prize year
born_date: Birth date of laureate
died_date: Death date of laureate
gender: Gender of laureate
born_city: City where laureate was born
born_country: Country where laureate was born
born_country_code: Code of country where laureate was born
died_city: City where laureate died
died_country: Country where laureate died
died_country_code: Code of country where laureate died
overall_motivation: Overall motivation for recognition
share: Number of other winners award is shared with
motivation: Motivation for recognition

In a few cases the name of the city/country changed after laureate was given (e.g. in 1975 Bosnia and Herzegovina was called the Socialist Federative Republic of Yugoslavia). In these cases the variables below reflect a different name than their counterparts without the suffix _original.

born_country_original: Original country where laureate was born
born_city_original: Original city where laureate was born
died_country_original: Original country where laureate died
died_city_original: Original city where laureate died
city_original: Original city where laureate lived at the time of winning the award
country_original: Original country where laureate lived at the time of winning the award

Get to know your data

How many observations and how many variables are in the dataset? Use inline code to answer this question. What does each row represent?

There are some observations in this dataset that we will exclude from our analysis to match the Buzzfeed results.

Create a new data frame called nobel_living that filters for

laureates for whom country is available
laureates who are people as opposed to organizations (organizations are denoted with "org" as their gender)
laureates who are still alive (their died_date is NA)

Confirm that once you have filtered for these characteristics you are left with a data frame with 228 observations, once again using inline code.

Most living Nobel laureates were based in the US when they won their prizes

… says the Buzzfeed article. Let’s see if that’s true.

First, we’ll create a new variable to identify whether the laureate was in the US when they won their prize. We’ll use the mutate() function for this. The following pipeline mutates the nobel_living data frame by adding a new variable called country_us. We use an if statement to create this variable. The first argument in the if_else() function we’re using to write this if statement is the condition we’re testing for. If country is equal to "USA", we set country_us to "USA". If not, we set the country_us to "Other".

nobel_living <- nobel_living |>
  mutate(
    country_us = if_else(country == "USA", "USA", "Other")
  )

Next, we will limit our analysis to only the following categories: Physics, Medicine, Chemistry, and Economics.

nobel_living_science <- nobel_living |>
  filter(category %in% c("Physics", "Medicine", "Chemistry", "Economics"))

For the following exercises, work with the nobel_living_science data frame you created above. This means you’ll need to define this data frame in your Quarto document, even though the next exercise doesn’t explicitly ask you to do so.

Create a faceted bar plot visualizing the relationship between the category of prize and whether the laureate was in the US when they won the nobel prize. Interpret your visualization, and say a few words about whether the Buzzfeed headline is supported by the data.
- Your visualization should be faceted by category.
- For each facet you should have two bars, one for winners in the US and one for Other.
- Flip the coordinates so the bars are horizontal, not vertical.

Now is a good time to render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

But of those US-based Nobel laureates, many were born in other countries

Create a new variable called born_country_us in nobel_living_science that has the value "USA" if the laureate is born in the US, and "Other" otherwise. How many of the winners are born in the US?

Note

You should be able to ~~cheat~~ borrow from code you used earlier to create the country_us variable.

Add a second variable to your visualization from Exercise 3 based on whether the laureate was born in the US or not. Create two visualizations with this new variable added:
- Plot 1: Segmented frequency bar plot
- Plot 2: Segmented relative frequency bar plot (Hint: Add position = "fill" to geom_bar().)
Here are some instructions that apply to both of these visualizations:
- Your final visualization should contain a facet for each category.
- Within each facet, there should be two bars for whether the laureate won the award in the US or not.
- Each bar should have segments for whether the laureate was born in the US or not.
Which of these visualizations is a better fit for answering the following question: “Do the data appear to support Buzzfeed’s claim that of those US-based Nobel laureates, most were born in other countries?” First, state which plot you’re using to answer the question. Then, answer the question, explaining your reasoning in 1-2 sentences.

Now is a good time to render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

In a single pipeline, filter the nobel_living_science data frame for laureates who won their prize in the US, but were born outside of the US, and then create a frequency table (with the count() function) for their birth country (born_country) and arrange the resulting data frame in descending order of number of observations for each country. Which country is the most common?

Now is a good time to render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceeding.

Submission

Once you are finished with the lab, you will your final PDF document to Gradescope.

Warning

Before you wrap up the assignment, make sure all documents are updated on your GitHub repo. We will be checking these to make sure you have been practicing how to commit and push changes.

You must turn in a PDF file to the Gradescope page by the submission deadline to be considered “on time”.

Make sure your data are tidy! That is, your code should not be running off the pages and spaced properly. See: https://style.tidyverse.org/ggplot2.html.

To submit your assignment:

Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials \(\rightarrow\) Duke NetID and log in using your NetID credentials.
Click on your STA 199 course.
Click on the assignment, and you’ll be prompted to submit it.
Mark all the pages associated with exercise. All the pages of your lab should be associated with at least one question (i.e., should be “checked”). If you do not do this, you will be subject to lose points on the assignment.
Do not select any pages of your .pdf submission to be associated with the “Workflow & formatting” question.

Grading

Component	Points
Ex 1	5
Ex 2	6
Ex 3	7
Ex 4	5
Ex 5	10
Ex 6	7
Team Agreement	5
Workflow & formatting	5
Total	50

Note

The “Workflow & formatting” grade is to assess the reproducible workflow. This includes:

linking all pages appropriately on Gradescope
putting your name in the YAML at the top of the document
committing the submitted version of your .qmd to GitHub
Are you under the 80 character code limit? (You shouldn’t have to scroll to see all your code). Pipes %>%, |> and ggplot layers + should be followed by a new line
You should be consistent with stylistic choices, e.g. only use 1 of = vs <- and %>% vs |>
All binary operators should be surrounded by space. For example x + y is appropriate. x+y is not.