Beijing 2022 Olympics - Cleaning and Summarizing Data

Data Wrangling

Joining / Merging

Data Visualization

Demonstrating how to clean data with different packages and summarise the data using dplyr.

Authors

Affiliation

Eric Seltzer

St. Lawrence University

Ivan Ramler

St. Lawrence University

Published

March 8, 2024

Module

Please note that these material have not yet completed the required pedagogical and industry peer-reviews to become a published module on the SCORE Network. However, instructors are still welcome to use these materials if they are so inclined.

Introduction

The Winter Olympics is an event hosted every 4 years, with the most recent one held in Beijing in 2022. In the 2022 Olympics, there were a total of 109 events across 15 disciplines.

For the Beijing 2022 Winter Olympics detailed information on athletes, events, and medal counts was collected. For example, attributes such as the athlete’s name, nationality, event specifics, and medal results are publicly available.

Analyzing data such as this can provide valuable insights into performance trends, country-wise achievements, and athlete statistics. Such analyses can help in identifying patterns, understanding factors contributing to success, and making informed predictions for future events.

In this module, we will explore some of these data sets and learn how to clean them into tidy formats. We will use these cleaned data sets to investigate questions such as which country won the most medals. Cleaning and manipulating data sets is a foundation skills in developing as a Data Scientist and opens many doors for what you can learn in the future.

Learning Objectives

By the end of this activity, you will be able to:

Understand how to deal with columns that have incorrect variable types.
Describe what is considered a tidy data set.
Manipulate variables into either new variables, or changed versions of themselves.
Summarize key points and provide interpretations for these points.

Methods

Technology requirement: The activity handout provides examples of how to clean and summarize data. For this, the use of R Studio is required and we will be using a quarto document. However, the activity could be modified to ask students to produce that information from the raw dataset and/or extend the activity to investigate other variables available in the data. The activity could also easily be adapted to use other popular programming languages such as Python.

R Packages: Familiarity with the tidyverse suite of packages. In addition to the basic verbs in dplyr, students should be familiar with the separate command (tidyr) and joining commands (dplyr). Students will also need a small amount of lubridate and can benefit from some basic regular expressions (to be used with the separate function).
Tidy Data: You should understand what it means for a data set to be tidy, there is a general definition of requirements for what this means.
Knowledge of summarizing data, this can be done using dplyr functions. The outcomes of what these different functions do should be understood. For example, knowing how to calculate means, sample sizes, proportions and other summary statistics using the summarize function in dplyr.
Students should have some basic understanding of the join functions in R.

Data

There are two data sets here, athletes and medals.The first athletes contains information on individual athletes. Each row in this data set represents a different athlete that competed at the Beijing 2022 Olympic Winter Games. This data set has 2897 rows and 14 columns.

Download non tidy data: athletes.csv

Athletes

Variable Descriptions

Variable	Description
name	Athlete Name
short_name	Athlete Name(short)
gender	Athlete Gender
birth_date	Athlete Date of Birthday
birth_place	Athlete Place of Birthday
birth_country	Athlete Country of Birthday
country	Athlete Country
country_code	Athlete Country Code
discipline	Discipline
discipline_code	Discipline Code
residence_place	Athlete Residence Place
residence_country	Athlete Residence Country
height_m/ft	Athlete Height (meters and feet)
url	Athlete Profile Link

The second medals contains information on medal winners. Each row in this data set represents a medal won in the Beijing 2022 Olympic Winter Games. This can mean Gold, Bronze, or Silver medals. This data set has 694 rows and 12 columns.

Download non tidy data: medals.csv

Download tidy data: medalsTidy.csv

Medals

Variable Descriptions

Variable	Description
medal_type	Medal Type (Gold, Bronze, Silver)
medal_code	Medal Number (1, 2, 3)
medal_date	Date Medal Won
athlete_short_name	Athlete Name (Short)
athlete_name	Athlete Name
athlete_sex	Athlete Gender
athlete_link	Athlete Profile Link
event	Event Competed In
country	Country
country_code	Country Code
discipline	Discipline
discipline_code	Discipline Code

Data Source

Beijing 2022 Winter Olympic Games

Materials

Class Handout

Class Handout - Solutions

Conclusion

By working through the different exercises, students will become aware of how data cleaning and preparation are a key component to analysis.