Olympic Rowing Medals Between 1900 and 2022 - Data Wrangling

dplyr
filtering
grouping and summarizing
mutating
Arranging data to analyze the total number of medals and the weighted points for nations competing in rowing events in the Summer Olympic Games between 1900 and 2022.
Authors
Affiliation

Abigail Smith

St. Lawrence University

Robin Lock

St. Lawrence University

Ivan Ramler

St. Lawrence University

Published

June 13, 2024

Welcome video

Introduction

This activity organizes the data in Olympic rowing between 1900 and 2022 so that the total medals and points for countries can be analyzed.

The Summer Olympic Games are an international athletics event held every four years and hosted in different countries around the world. Rowing was added to the Olympics in 1896 but was cancelled due to weather. It has been in every Summer Olympics since 1900. Rowing races in the Olympic context are regatta style, meaning that there are multiple boats racing head to head against each other in multiple lanes. Since 1912, the standard distance for Olympic regattas has been 2000m, but before then there had been a range in distances. The boat that is first to cross the finish line is awarded a gold medal, the second a silver medal, and the third a bronze.

Over the course of its time as an Olympic sport there have been 25 different event entries. These events are separated by gender and range with the number of rowers in the boat (1, 2, 4, 6, 8, 17), the rigging (inrigged, outrigged), sculling, sweeping, and whether or not they are coxed. An inrigged shell means the riggers (where the oar is attached to the boat) are on the inside of the boat, outrigged shells mean the riggers are on the outside. Sculling is where the rowers have an oar on each side and sweeping is when each rower only has one oar on one side. The coxswain steers the boat and guides the rowers, some events have coxed boats whereas some others do not.

The original data set (athlete_events.csv) is not just about rowing but for every Olympic sport since 1896. It will be important to adjust the data so that it is just rowing. Moreover, medals are given to each athlete in the boat. In Olympic scoring however, the medals should be counted as one towards the entire boat. It is important to make sure this is the case when arranging the data.

In looking at the total medals and total points for each nation, it is interesting to see which nations dominate in Olympic rowing. Additionally, looking at the overall distribution of the medals for all countries provides insight on just how lop-sided medaling can be in rowing at the Olympic level. This effect could likely be attributed towards how much funding nations are placing towards their rowing teams.

This activity could be used as an example or a short take home assessment. Best to use as a predecessor to the applied regression analysis worksheets in the other rowing module.

By the end of the activity, students will be able to:

  1. Filter a dataset so there are no NA values
  2. Group data to procure summarized data
  3. Mutate data to create new variables

Students will use basic dplyr skills to structure data to match the descriptions listed in the questions. Basic ggplot2 functions are also needed to visualize the data.

All the questions will require the use of R or a similar programming language.

Data

In the data set there are 271,106 athletes from 230 countries competing in 66 sports at 35 different Olympics. Each row represents an individual athlete competing in an event at an Olympics. There are 15 variables in the dataset.

Download data:

athlete_events.csv

Variable Descriptions
Variable Description
ID The ID number assigned to the athlete.
Name The name of the athlete in first name last name order.
Sex The sex of the athlete.
Age The age of the athlete at the time of competing.
Height The height of the athlete at the time of competing. Measured in cm.
Weight The weight of the athlete at the time of competing. Measured in kg.
Team The team the athlete is a member of. In some cases this is different than theNOC.
NOC The nation the athlete is representing, usually the best variable to choose to analyze nations.
Games The Olympic Games the case is from. e.g. “1992 Summer Olympics”.
Year The year the athlete was competing.
Season The season the athlete was competing, “Summer” or “Winter”.
City The city the Olympics were hosted in the case.
Sport The sport the athlete competed in.
Event The event the athlete competed in.
Medal If the athlete medaled, which medal they won.

Data Source

Kaggle: 120 years of Olympic history: athletes and results

Materials

We provide editable qmd files along with their solutions.

Worksheet

Worksheet Answers

This dataprep worksheet helps students familiarize themselves with the use of basic dplyr tools to structure data in a way that is easier to analyze. In doing so, it enables to students to draw conclusions about different patterns in rowing as an Olympic sport when it comes to medaling.