library(tidyverse)
library(readr)Premier Hockey Federation Shots - Worksheet
In this worksheet, you will analyze play-by-play shot data from the 2021–2022 PHF season, including information about shooters, goalies, teams, and shot outcomes. A central focus will be shot efficiency and goaltending effectiveness, measured using statistics such as shooting percentage and save proportion.
Throughout this worksheet, you will clean, summarize, and visualize real hockey data using R, much like a sports data analyst working for a professional team. Using tools from the tidyverse, you will explore patterns in player and team performance.
Load the following libraries:
A. Loading in the data
Download the csv file containing data from the Premier Hockey Federation 2021-2022 season from the SCORE Network website from this URL and assign the data set a name. URL: https://data.scorenetwork.org/data/phf-shots-2021.csv. You can load the data set using read_csv().
B. Exploring the data
- How many rows and columns does the data set have? Hint: use the
dim()function.
- Use
head()orglimpse()to explore the structure of the data, then answer the questions that follow.
- Assuming that each team made at least one shot in the season (is listed under the
shooting_teamcolumn), use dplyr and tidyr functions to create a data frame showing all the teams that were in this season and the total number of shots each team made throughout the season.
The column names of your table should be: Team and Shots
- Using the data frame from Question 3, make a lollipop plot to visualize the number of shots made by each team in descending order of shots. Include a title and axis labels on your plot.
- Create a new variable called
is_goalthat is 1 whenshot_result == "made"and 0 otherwise. Hint: useif_else().
- The name of the player who made the shot is recorded under the column
player_name_1. Rename this column toshooter. Then create a bar plot to show (in descending order) the top ten most frequent shooters (shooters with the ten highest number of shots in the season) and the total number of shots they made.
- In hockey, save proportion measures how often a goalie successfully stops shots on goal. Shots on goal consist of shots that are recorded as “made” or “saved”, but do not include shots “blocked” as these are shots blocked by other defense players, not the goalie. Save proportion is calculated as the goalie’s (shots saved) / (total shots on goal).
- Using some dplyr and/or tidyr functions, create a data frame with each goalie’s save proportion.
Hint: The variables of interest here are goalie_involved (which shows the names of goalies) and shot_result, which has categories: blocked, made, saved. Since this exploration excludes blocked shots, start by filtering to exclude them.
- You should notice an NA value under
goalie_involvedafter completing this. Remove the row with that NA value from the final data frame.
- You should also notice that the
goalie_involvedvariable includes entries like “Allie Morse replaced by #”. Filter usinggoalie_involved %in% c()wherec()takes the names of goalies without “replaced…” entries.
- Using the data frame from your answer to Question 7c above, create a visualization of your choice to display goalies’ save proportions in ascending order. Use
coord_flip()such that the goalie variable is on the y axis and save proportion is on the x axis. Remember to use descriptive axis labels and a title.
- So far, we have compared goalies using ordered plots. Another useful way to understand data is by examining its distribution.
- Using the
phf_goaliesdata frame from Question 7c, create a histogram of goalie save proportions. Choose an appropriate bin width, and include a title and axis labels.
- Based on your histogram, describe the distribution of save proportions. In your answer, consider the following:
Is the distribution symmetric, left-skewed, or right-skewed?
Are there any noticeable outliers?