Beijing 2022 Olympics - Cleaning and Summarizing Data

Data cleaning
Summarization
Merging
Demonstrating how to clean data with different packages and summarise the data using dplyr.
Authors
Affiliation

Eric Seltzer

St. Lawrence University

Ivan Ramler

St. Lawrence University

Published

March 8, 2024

Module

Please note that these material have not yet completed the required pedagogical and industry peer-reviews to become a published module on the SCORE Network. However, instructors are still welcome to use these materials if they are so inclined.

Introduction

The Winter Olympics is an event hosted every 4 years, with the most recent one held in Beijing in 2022. In the 2022 Olympics, there were a total of 109 events across 15 disciplines.

For the Beijing 2022 Winter Olympics detailed information on athletes, events, and medal counts was collected. For example, attributes such as the athlete’s name, nationality, event specifics, and medal results are publicly available.

Analyzing data such as this can provide valuable insights into performance trends, country-wise achievements, and athlete statistics. Such analyses can help in identifying patterns, understanding factors contributing to success, and making informed predictions for future events.

In this module, we will explore some of these data sets and learn how to clean them into tidy formats. We will use these cleaned data sets to investigate questions such as which country won the most medals. Cleaning and manipulating data sets is a foundation skills in developing as a Data Scientist and opens many doors for what you can learn in the future.

By the end of this activity, you will be able to:

  1. Understand how to deal with columns that have incorrect variable types.

  2. Describe what is considered a tidy data set.

  3. Manipulate variables into either new variables, or changed versions of themselves.

  4. Summarize key points and provide interpretations for these points.

Technology requirement: The activity handout provides examples of how to clean and summarize data. For this, the use of R Studio is required and we will be using a quarto document. However, the activity could be modified to ask students to produce that information from the raw dataset and/or extend the activity to investigate other variables available in the data. The activity could also easily be adapted to use other popular programming languages such as Python.

  1. R Packages: Familiarity with the tidyverse suite of packages. In addition to the basic verbs in dplyr, students should be familiar with the separate command (tidyr) and joining commands (dplyr). Students will also need a small amount of lubridate and can benefit from some basic regular expressions (to be used with the separate function).

  2. Tidy Data: You should understand what it means for a data set to be tidy, there is a general definition of requirements for what this means.

  3. Knowledge of summarizing data, this can be done using dplyr functions. The outcomes of what these different functions do should be understood. For example, knowing how to calculate means, sample sizes, proportions and other summary statistics using the summarize function in dplyr.

  4. Students should have some basic understanding of the join functions in R.

Data

There are two data sets here, athletes and medals.The first athletes contains information on individual athletes. Each row in this data set represents a different athlete that competed at the Beijing 2022 Olympic Winter Games. This data set has 2897 rows and 14 columns.

Download non tidy data: athletes.csv

Athletes

Variable Descriptions
Variable Description
name Athlete Name
short_name Athlete Name(short)
gender Athlete Gender
birth_date Athlete Date of Birthday
birth_place Athlete Place of Birthday
birth_country Athlete Country of Birthday
country Athlete Country
country_code Athlete Country Code
discipline Discipline
discipline_code Discipline Code
residence_place Athlete Residence Place
residence_country Athlete Residence Country
height_m/ft Athlete Height (meters and feet)
url Athlete Profile Link

The second medals contains information on medal winners. Each row in this data set represents a medal won in the Beijing 2022 Olympic Winter Games. This can mean Gold, Bronze, or Silver medals. This data set has 694 rows and 12 columns.

Download non tidy data: medals.csv

Download tidy data: medalsTidy.csv

Medals

Variable Descriptions
Variable Description
medal_type Medal Type (Gold, Bronze, Silver)
medal_code Medal Number (1, 2, 3)
medal_date Date Medal Won
athlete_short_name Athlete Name (Short)
athlete_name Athlete Name
athlete_sex Athlete Gender
athlete_link Athlete Profile Link
event Event Competed In
country Country
country_code Country Code
discipline Discipline
discipline_code Discipline Code

Data Source

Beijing 2022 Winter Olympic Games

Materials

Class Handout

Class Handout - Solutions

By working through the different exercises, students will become aware of how data cleaning and preparation are a key component to analysis.