Merging Multiple Seasons of NCAA Div 1 Softball Statistics
Module
Please note that these material have not yet completed the required pedagogical and industry peer-reviews to become a published module on the SCORE Network. However, instructors are still welcome to use these materials if they are so inclined.
Introduction
Data for a particular sport is often stored across numerous locations. For example, in NCAA Division I Softball, batting statistics for each season are typically stored in separate tables. (See for example the statistics hosted by https://d1softball.com/.)
Suppose we are interested in tracking the statistics of players across multiple seasons. A common way to prep the data to do this is to use join statements to merge each seasons data into one table with one row per player (and columns associated with their different statistics for each season).
This module looks at some simple batting stats over two seasons through the use of joining functions for a small subset of NCAA Division 1 Softball players’ statistics for the 2021 and 2022 seasons. (This is only a small window of a much bigger dataset).
Data
There are four data tables associated with this module.
Two of them, batting2021
and batting2022
, contain statistics for all Division 1 Softball player who played in the 2021 and/or 2022 season.
The other two, batting2021_subset
and batting2022_subset
, represent non-random samples taken the full data. These are displayed below. R is the number of runs scored by the player, H is the number of hits.
Additional details for the full data (batting2021 and batting2022)
The 2021 data set has 2131 rows with 22 columns. The 2022 data set has 2275 rows with the same 22 columns. Each row represents a Division 1 Softball player who played in the 2021 and/or 2022 season.
Variable | Description |
---|---|
Player | Name of the softball player |
Team | Name of the team (school) for each player |
BA | Batting Average - The ratio of a player’s total base hits to their total number of at-bats, indicating their ability to make successful hits. |
OBP | On-Base Percentage - The percentage of times a player reaches base, either through a hit, walk, or hit-by-pitch, out of their total plate appearances. |
SLG | Slugging Percentage - The measure of a player’s power by calculating the total number of bases they accumulate per at-bat. |
OPS | On-Base Plus Slugging - The sum of a player’s on-base percentage and slugging percentage, providing a comprehensive measure of their offensive contribution. |
GP | Games Played - The total number of games in which a player has participated. |
PA | Plate Appearances - The total number of times a player has come up to bat, including at-bats, walks, and hit-by-pitches. |
AB | At-Bats - The number of times a player has officially faced a pitcher and had an opportunity to hit. |
R | Runs Scored - The number of times a player has crossed home plate and scored a run. |
H | Hits - The total number of successful hits made by a player. |
2B | Doubles - The number of hits resulting in the batter reaching second base. |
3B | Triples - The number of hits resulting in the batter reaching third base. |
HR | Home Runs - The number of hits resulting in the batter scoring a run by hitting the ball out of the park. |
RBI | Runs Batted In - The number of runs a player has driven in with a hit or sacrifice. |
HBP | Hit by Pitch - The number of times a player has been struck by a pitched ball and awarded first base. |
BB | Walks - The number of times a player has received a base on balls (four balls) and been awarded first base. |
K | Strikeouts - The number of times a player has been called out after accumulating three strikes. |
SB | Stolen Bases - The number of times a player successfully advances to the next base without a hit, during a pitch while the ball is in play. |
CS | Caught Stealing - The number of times a player is thrown out while attempting to steal a base. |
CSV format data files
- batting2021_subset.csv: Subset of full dataset containing the season total hits and runs for a non-random sample of six players from 2021
- batting2022_subset.csv: Subset of full dataset containing the season total hits and runs for a non-random sample of six players from 2022
- batting2021.csv: Dataset with the 2021 batting statistics for all 2131 Division 1 softball players
- batting2022.csv: Dataset with the 2022 batting statistics for all 2275 Division 1 softball players
Database (SQL) style
- softball_batting.duckdb: DuckDB database containing four table: the full set of players for each season (batting2021 and batting2022) and the two subsets (batting2021_subset and batting2022_subset).
- softball_batting.sqlite: SQLite database containing four table: the full set of players for each season (batting2021 and batting2022) and the two subsets (batting2021_subset and batting2022_subset).
Data Source
Statistics. D1Softball. (2019, October 24). https://d1softball.com/statistics/