Clinton Canoe Regatta: Graphing and Data Manipulation

Data visualization
Data wrangling
Using data from the Clinton Canoe Regatta to explore ggplot and dplyr
Published

June 17, 2025

Module

Please note that these material have not yet completed the required pedagogical and industry peer-reviews to become a published module on the SCORE Network. However, instructors are still welcome to use these materials if they are so inclined.

Introduction

The Clinton Canoe Regatta (the 70-miler) is an annual event in the downriver canoeing community. This race is one leg of the Canoeing Triple Crown (a challenge made up of three marathon canoe races in North America) and spans about 63 miles on the Susquehanna River in New York, from Cooperstown to Bainsbridge. Paddlers compete against each other to achieve the fastest time in their class.

This could serve as an in class activity and should take roughly a half an hour to complete.

By the end of this activity, students will:

  1. Increase ability to create different plots in R using GGPlot.

  2. Learn more about filtering to include specific observations.

  3. Be able to notice and fix issues in data.

Students should have introductory experience with both ggplot and dplyr.

Students will need to use RStudio.

Data

This activity uses the data set: full_clinton.csv

The full_clinton.csv data set includes data from Paddlestats about the Clinton Canoe Regatta from the years 2014-2025 (excluding 2020 and 2021 as no race was held due to the COVID-19 pandemic). Each row refers to a specific racer from a year at the race.

Variable Descriptions
Variable Description
year Year raced
boatType Type of boat raced (canoe, kayak, SUP)
bib Bib number
classID Class (oc4stock, oc2pro, etc)
place Where the boat placed in their class’s results
time The time to complete the race (in sec)
overall Overall rank (out of all boats ever)
byBoat Rank out of all boats of the same type
byClass Rank out of all boats in class (all years)
byYrClass Rank out of the boats in their class in their year
byYrOver Rank out of all the boats that year (all classes)
byYrBoat Rank out of all boats of the same type that year
courseID Version of the course raced (all the same here)
notes Extra information (usually about class)
personID The person’s PaddleStats ID
displayName Name of the racer
sortName Name of the race with last name first
city Racer’s home city (as given at registration)
state Racer’s home state or province (as given at registration)
country Racer’s home country (as given at registration)
teamName Racer’s team name
status Finishing status (DNF, DNS, DQ, etc.)

Data Source

The data are compiled from the Paddlestats Website.

Materials

Module Worksheet

Solutions to Worksheet

Upon conclusion of this module, students will have a greater understanding of the uses of ggplot and dplyr and will be better prepared to use both with less clear-cut data.