Beijing 2022 Olympics - Cleaning and Summarizing Data
Module
Please note that these material have not yet completed the required pedagogical and industry peer-reviews to become a published module on the SCORE Network. However, instructors are still welcome to use these materials if they are so inclined.
Introduction
The Winter Olympics is an event hosted every 4 years, with the most recent one held in Beijing in 2022. In the 2022 Olympics, there were a total of 109 events across 15 disciplines.
For the Beijing 2022 Winter Olympics detailed information on athletes, events, and medal counts was collected. For example, attributes such as the athlete’s name, nationality, event specifics, and medal results are publicly available.
Analyzing data such as this can provide valuable insights into performance trends, country-wise achievements, and athlete statistics. Such analyses can help in identifying patterns, understanding factors contributing to success, and making informed predictions for future events.
In this module, we will explore some of these data sets and learn how to clean them into tidy formats. We will use these cleaned data sets to investigate questions such as which country won the most medals. Cleaning and manipulating data sets is a foundation skills in developing as a Data Scientist and opens many doors for what you can learn in the future.
Data
There are two data sets here, athletes
and medals
.The first athletes contains information on individual athletes. Each row in this data set represents a different athlete that competed at the Beijing 2022 Olympic Winter Games. This data set has 2897 rows and 14 columns.
Download non tidy data: athletes.csv
Athletes
Variable Descriptions
Variable | Description |
---|---|
name | Athlete Name |
short_name | Athlete Name(short) |
gender | Athlete Gender |
birth_date | Athlete Date of Birthday |
birth_place | Athlete Place of Birthday |
birth_country | Athlete Country of Birthday |
country | Athlete Country |
country_code | Athlete Country Code |
discipline | Discipline |
discipline_code | Discipline Code |
residence_place | Athlete Residence Place |
residence_country | Athlete Residence Country |
height_m/ft | Athlete Height (meters and feet) |
url | Athlete Profile Link |
The second medals contains information on medal winners. Each row in this data set represents a medal won in the Beijing 2022 Olympic Winter Games. This can mean Gold, Bronze, or Silver medals. This data set has 694 rows and 12 columns.
Download non tidy data: medals.csv
Download tidy data: medalsTidy.csv
Medals
Variable Descriptions
Variable | Description |
---|---|
medal_type | Medal Type (Gold, Bronze, Silver) |
medal_code | Medal Number (1, 2, 3) |
medal_date | Date Medal Won |
athlete_short_name | Athlete Name (Short) |
athlete_name | Athlete Name |
athlete_sex | Athlete Gender |
athlete_link | Athlete Profile Link |
event | Event Competed In |
country | Country |
country_code | Country Code |
discipline | Discipline |
discipline_code | Discipline Code |