WNBA Data Cleaning Module

Data cleaning
Missing Data
Tables
This data set contains team game statistics for 16 WNBA teams and for 20 seasons.
Authors
Affiliation

Kristen Varin

St. Lawrence University

Ivan Ramler

St. Lawrence University

Published

May 15, 2024

Module

Please note that these material have not yet completed the required pedagogical and industry peer-reviews to become a published module on the SCORE Network. However, instructors are still welcome to use these materials if they are so inclined.

Introduction

The WNBA worksheet will introduce the idea of predicting playoff teams from their record earlier in the season. Most of the time, it is expected that teams with better records halfway through the season will have a higher chance of making it to the playoffs, but is this always the case? How do you know that the data you are using is a valid way of predicting this? By completing this worksheet, you will be guided through various data cleaning steps and will be able to answer these questions.

The Women’s National Basketball Association (WNBA) is a professional basketball league in the United States that was founded in 1996. It was established by the NBA as a counterpart to promote women’s basketball at the professional level. The league’s inaugural season began in 1997 with eight teams.

Throughout its history, the WNBA has been a pioneering force in women’s sports, providing a platform for talented athletes to showcase their skills and inspire fans globally. The league has expanded and contracted over the years and, as of 2024, consists of 12 teams.

WNBA expansion and contraction
Season(s) No. of teams
1997 8
1998 10
1999 12
2000–2002 16
2003 14
2004–2005 13
2006 14
2007 13
2008 14
2009 13
2010–2024 12
2025 13
2026–future 14

The WNBA’s playoff structure typically features the top eight teams from the regular season standings advancing to the postseason. The playoffs are organized into single-elimination rounds, culminating in the WNBA Finals, where the last two teams standing compete in a best-of-five series to determine the league champion.

Beyond its competitive play, the WNBA has been a leader in promoting social justice initiatives and advocating for equality both on and off the court. It continues to grow in popularity and influence, contributing significantly to the growth of women’s basketball worldwide.

By the end of this activity, you will be able to:

  • Use various dplyr, tidyr, and lubridate package functions to clean a data set for further use.

  • Learn the process of cleaning data, recognizing any problems within the data, and determining whether the data can be used for the given analysis problem.

  • Understand why missing data presents a problem for analysis and brainstorm ways to troubleshoot this problem.

Technology requirement: The activity handout requires knowledge of Quarto and the following tidyverse packages: dplyr, tidyr, lubridate. The activity will allow students to use these packages together to complete a more complex task.

Data

The wnba_data data set contains 8920 rows and 9 columns. Each row represents a game played by a WNBA team in one of the 2003 to 2022 regular seasons. Thus, each game is associated with two rows: one for each team. The columns are as follows:

Data: Variable Descriptions
Variable Description
game_id game id number
season season number
season_type binary predictor; 2 if regular season game; 3 if playoff game
game_date date of the game
team_id team id number
team_display_name full team name (name and city)
team_winner Boolean; True if the team won the game
opponent_team_id id number of the opponent
team_home_away Where the game was played; either “home” or “away”

Download data: wnba_data.csv

Data Source

Gilani S, Hutchinson G (2022). wehoop: Access Women’s Basketball Play by Play Data. R package version 1.5.0, https://CRAN.R-project.org/package=wehoop.

Materials

We offer worksheets (and their solutions) in Quarto (using R) format.

Class handout - Quarto

Class handout - Quarto - with solutions

Exploration of the WNBA data revealed that there are missing values for games throughout the 2003-2022 seasons. We found this by tallying the number of games played by each team within each season and noticing that the number of games played was inconsistent between teams. We then used the internet to find an example of a missing game within the dataset.

Reflecting on the challenges that missing values presented us, we came up with a few solutions to continue our analysis. Some of these include: filtering the data for seasons that have no missing data or using the internet to manually fill in data. While neither of these methods are perfect and present other challenges, they are valid ways to deal with missing values.

Once dealing with the seasons with imperfect data, students can continue the analysis to find a two way table depicting mid season record, playoff prediction, season record and whether a team made playoffs.