Premier Hockey Federation Shots - Solutions

In this worksheet, you will analyze play-by-play shot data from the 2021–2022 PHF season, including information about shooters, goalies, teams, and shot outcomes. A central focus will be shot efficiency and goaltending effectiveness, measured using statistics such as shooting percentage and save proportion.

Throughout this worksheet, you will clean, summarize, and visualize real hockey data using R, much like a sports data analyst working for a professional team. Using tools from the tidyverse, you will explore patterns in player and team performance.

Load the following libraries:

library(tidyverse)
library(readr)

A. Loading in the data

Download the csv file containing data from the Premier Hockey Federation 2021-2022 season from the SCORE Network website from this URL and assign the data set a name. URL: https://data.scorenetwork.org/data/phf-shots-2021.csv. You can load the data set using read_csv().

Click for solution
url <- "https://data.scorenetwork.org/data/phf-shots-2021.csv"

phf_shots <- read_csv(url)

B. Exploring the data

  1. How many rows and columns does the data set have? Hint: use the dim() function.
Click for solution
dim(phf_shots)
[1] 1502   17
# 1505 rows and 17 columns
  1. Use the View() function (or head() / glimpse()) to explore the data, then answer the questions that follow. Note: View() only works interactively in RStudio and will not render in a Quarto document.
Click for solution
# View(phf_shots) # use interactively in RStudio
glimpse(phf_shots)
Rows: 1,502
Columns: 17
$ play_description <chr> "#26 Kiira Dosdall-Arena blocked by #44 Lindsay Eastw…
$ play_type        <chr> "Shot BLK", "Goal", "Shot", "Shot", "Shot", "Shot", "…
$ period_id        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ time_remaining   <chr> "18:25", "18:17", "17:59", "17:55", "14:33", "12:54",…
$ sec_from_start   <dbl> 95, 103, 121, 125, 327, 426, 445, 455, 515, 523, 556,…
$ home_team        <chr> "Metropolitan Riveters", "Metropolitan Riveters", "Me…
$ away_team        <chr> "Toronto Toronto", "Toronto Toronto", "Toronto Toront…
$ home_goals       <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ away_goals       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ shooting_team    <chr> "Metropolitan Riveters", "Metropolitan Riveters", "Me…
$ player_name_1    <chr> "Kiira Dosdall-Arena", "Leila Kilduff", "Allie Olnowi…
$ player_name_2    <chr> "Lindsay Eastwood", "Kelly Babsck", "Elaine Chuli", "…
$ goalie_involved  <chr> "Elaine Chuli", "Elaine Chuli", "Elaine Chuli", "Elai…
$ shot_result      <chr> "blocked", "made", "saved", "saved", "saved", "saved"…
$ on_ice_situation <chr> "Even Strength", "Even Strength", "Even Strength", "E…
$ home_score_total <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
$ away_score_total <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
  1. Assuming that each team made at least one shot in the season (is listed under the shooting_team column), use dplyr and tidyr functions to create a data frame showing all the teams that were in this season and the total number of shots each team made throughout the season.

The column names of your table should be: Team and Shots

Click for solution
phf_teams <- phf_shots |>
  group_by(shooting_team) |>
  summarise(Shots = n()) |>
  rename(Team = shooting_team)
  
phf_teams
# A tibble: 6 × 2
  Team                  Shots
  <chr>                 <int>
1 Boston Pride            423
2 Buffalo Beauts          166
3 Connecticut Whale       172
4 Metropolitan Riveters   124
5 Minnesota Whitecaps     229
6 Toronto Toronto         388
  1. Using the data frame from Question 3, make a lollipop plot to visualize the number of shots made by each team in descending order of shots. Include a title and axis labels on your plot.
Click for solution
phf_teams <- phf_teams |>
  mutate(Team = fct_reorder(Team, Shots))

ggplot(data = phf_teams, aes(x = Team, y = Shots)) +
  geom_point() +
  geom_segment(aes(xend = Team, y = 0, yend = Shots)) +
  coord_flip() +
  theme_minimal() +
  labs(title = "Total Shots Taken by Each Team in the '21-'22 PHF Season",
       y = "Total Number of Shots",
       x = "Team")

  1. Create a new variable called is_goal that is 1 when shot_result == "made" and 0 otherwise. Hint: use if_else().
Click for solution
phf_shots <- phf_shots |>
  mutate(is_goal = if_else(shot_result == "made", 1, 0))
  1. The name of the player who made the shot is recorded under the column player_name_1. Rename this column to shooter. Then create a bar plot to show (in descending order) the top ten most frequent shooters (shooters with the ten highest number of shots in the season) and the total number of shots they made.
Click for solution
top10_shooters <- phf_shots |>
  rename(shooter = player_name_1) |> 
  group_by(shooter) |>
  summarise(total_shots = n()) |>
  arrange(desc(total_shots)) |>
  slice(1:10) |>
  mutate(shooter = fct_reorder(shooter, total_shots))

ggplot(data = top10_shooters, aes(x = shooter, y = total_shots)) +
  geom_col(fill = "skyblue") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Top 10 Shooters of the '21-'22 PHF Season",
       y = "Total Number of Shots",
       x = "Shooter")

  1. In hockey, save proportion measures how often a goalie successfully stops shots on goal. Shots on goal consist of shots that are recorded as “made” or “saved”, but do not include shots “blocked” as these are shots blocked by other defense players, not the goalie. Save proportion is calculated as the goalie’s (shots saved) / (total shots on goal).
  1. Using some dplyr and/or tidyr functions, create a data frame with each goalie’s save proportion.

Hint: The variables of interest here are goalie_involved (which shows the names of goalies) and shot_result, which has categories: blocked, made, saved. Since this exploration excludes blocked shots, start by filtering to exclude them.

Click for solution
phf_goalies <- phf_shots |>
  filter(shot_result != "blocked") |>
  group_by(goalie_involved) |>
  summarise(
    shots_on_goal = n(),
    total_goals = sum(shot_result == "made"),
    total_saves = sum(shot_result == "saved"),
    save_prop = total_saves / shots_on_goal
  )
  
phf_goalies
# A tibble: 18 × 5
   goalie_involved           shots_on_goal total_goals total_saves save_prop
   <chr>                             <int>       <int>       <int>     <dbl>
 1 Abbie Ives                           95          11          84     0.884
 2 Abbie Ives replaced by #1            18           2          16     0.889
 3 Abbie Ives replaced by #3             8           3           5     0.625
 4 Allie Morse replaced by #            30           3          27     0.9  
 5 Amanda Leveille                     202          12         190     0.941
 6 Brooke Wolejko                        6           2           4     0.667
 7 Brooke Wolejko replaced b             4           0           4     1    
 8 Carly Jackson                       186          18         168     0.903
 9 Carly Jackson replaced by            12           3           9     0.75 
10 Elaine Chuli                        147          14         133     0.905
11 Elaine Chuli replaced by              4           1           3     0.75 
12 Lovisa Selander                     188          12         176     0.936
13 Lovisa Selander replaced              4           0           4     1    
14 Samantha Ridgewell                   26           6          20     0.769
15 Sonjia Shelly                        77           1          76     0.987
16 Tera Hofmann                         36           3          33     0.917
17 Victoria Hanson                      58           5          53     0.914
18 <NA>                                  3           3           0     0    
  1. You should notice an NA value under goalie_involved after completing this. Remove the row with that NA value from the final data frame.
Click for solution
phf_goalies <- phf_goalies |>
  filter(!is.na(goalie_involved))

phf_goalies
# A tibble: 17 × 5
   goalie_involved           shots_on_goal total_goals total_saves save_prop
   <chr>                             <int>       <int>       <int>     <dbl>
 1 Abbie Ives                           95          11          84     0.884
 2 Abbie Ives replaced by #1            18           2          16     0.889
 3 Abbie Ives replaced by #3             8           3           5     0.625
 4 Allie Morse replaced by #            30           3          27     0.9  
 5 Amanda Leveille                     202          12         190     0.941
 6 Brooke Wolejko                        6           2           4     0.667
 7 Brooke Wolejko replaced b             4           0           4     1    
 8 Carly Jackson                       186          18         168     0.903
 9 Carly Jackson replaced by            12           3           9     0.75 
10 Elaine Chuli                        147          14         133     0.905
11 Elaine Chuli replaced by              4           1           3     0.75 
12 Lovisa Selander                     188          12         176     0.936
13 Lovisa Selander replaced              4           0           4     1    
14 Samantha Ridgewell                   26           6          20     0.769
15 Sonjia Shelly                        77           1          76     0.987
16 Tera Hofmann                         36           3          33     0.917
17 Victoria Hanson                      58           5          53     0.914
  1. You should also notice that the goalie_involved variable includes entries like “Allie Morse replaced by #”. Filter using goalie_involved %in% c() where c() takes the names of goalies without “replaced…” entries.
Click for solution
phf_goalies <- phf_goalies |>
  filter(goalie_involved %in% c(
    "Abbie Ives", "Amanda Leveille", "Brooke Wolejko",
    "Carly Jackson", "Elaine Chuli", "Samantha Ridgewell",
    "Sonjia Shelly", "Tera Hofmann", "Victoria Hanson"
  ))

phf_goalies
# A tibble: 9 × 5
  goalie_involved    shots_on_goal total_goals total_saves save_prop
  <chr>                      <int>       <int>       <int>     <dbl>
1 Abbie Ives                    95          11          84     0.884
2 Amanda Leveille              202          12         190     0.941
3 Brooke Wolejko                 6           2           4     0.667
4 Carly Jackson                186          18         168     0.903
5 Elaine Chuli                 147          14         133     0.905
6 Samantha Ridgewell            26           6          20     0.769
7 Sonjia Shelly                 77           1          76     0.987
8 Tera Hofmann                  36           3          33     0.917
9 Victoria Hanson               58           5          53     0.914
  1. Using the data frame from your answer to Question 7c above, create a visualization of your choice to display goalies’ save proportions in ascending order. Use coord_flip() such that the goalie variable is on the y axis and save proportion is on the x axis. Remember to use descriptive axis labels and a title.
Click for solution
# A lollipop plot, bar plot, or dot plot would all work here.

phf_goalies <- phf_goalies |>
  mutate(goalie_involved = fct_reorder(goalie_involved, save_prop))

ggplot(phf_goalies, aes(x = save_prop, y = goalie_involved)) +
  geom_point(size = 3, colour = "skyblue") +
  theme_minimal() +
  labs(title = "Goalie Save Proportions in the '21-'22 PHF Season",
       x = "Save Proportion",
       y = "Goalie")

  1. So far, we have compared goalies using ordered plots. Another useful way to understand data is by examining its distribution.
  1. Using the phf_goalies data frame from Question 7c, create a histogram of goalie save proportions. Choose an appropriate bin width, and include a title and axis labels.
Click for solution
ggplot(phf_goalies, aes(x = save_prop)) +
  geom_histogram(binwidth = 0.02, fill = "skyblue") +
  theme_minimal() +
  labs(title = "Distribution of Goalie Save Proportions",
       x = "Save Proportion",
       y = "Count")

  1. Based on your histogram, describe the distribution of save proportions. In your answer, consider the following:
  • Is the distribution symmetric, left-skewed, or right-skewed?

  • Are there any noticeable outliers?

Click for solution

Most goalies have a save proportion between about 0.87 and 0.95. The distribution is left-skewed, with one goalie posting a save proportion below 0.70 — a potential outlier well below the rest of the group.