library(tidyverse)
library(readr)Premier Hockey Federation Shots - Solutions
In this worksheet, you will analyze play-by-play shot data from the 2021–2022 PHF season, including information about shooters, goalies, teams, and shot outcomes. A central focus will be shot efficiency and goaltending effectiveness, measured using statistics such as shooting percentage and save proportion.
Throughout this worksheet, you will clean, summarize, and visualize real hockey data using R, much like a sports data analyst working for a professional team. Using tools from the tidyverse, you will explore patterns in player and team performance.
Load the following libraries:
A. Loading in the data
Download the csv file containing data from the Premier Hockey Federation 2021-2022 season from the SCORE Network website from this URL and assign the data set a name. URL: https://data.scorenetwork.org/data/phf-shots-2021.csv. You can load the data set using read_csv().
Click for solution
url <- "https://data.scorenetwork.org/data/phf-shots-2021.csv"
phf_shots <- read_csv(url)B. Exploring the data
- How many rows and columns does the data set have? Hint: use the
dim()function.
Click for solution
dim(phf_shots)[1] 1502 17
# 1505 rows and 17 columns- Use the
View()function (orhead()/glimpse()) to explore the data, then answer the questions that follow. Note:View()only works interactively in RStudio and will not render in a Quarto document.
Click for solution
# View(phf_shots) # use interactively in RStudio
glimpse(phf_shots)Rows: 1,502
Columns: 17
$ play_description <chr> "#26 Kiira Dosdall-Arena blocked by #44 Lindsay Eastw…
$ play_type <chr> "Shot BLK", "Goal", "Shot", "Shot", "Shot", "Shot", "…
$ period_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ time_remaining <chr> "18:25", "18:17", "17:59", "17:55", "14:33", "12:54",…
$ sec_from_start <dbl> 95, 103, 121, 125, 327, 426, 445, 455, 515, 523, 556,…
$ home_team <chr> "Metropolitan Riveters", "Metropolitan Riveters", "Me…
$ away_team <chr> "Toronto Toronto", "Toronto Toronto", "Toronto Toront…
$ home_goals <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ away_goals <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ shooting_team <chr> "Metropolitan Riveters", "Metropolitan Riveters", "Me…
$ player_name_1 <chr> "Kiira Dosdall-Arena", "Leila Kilduff", "Allie Olnowi…
$ player_name_2 <chr> "Lindsay Eastwood", "Kelly Babsck", "Elaine Chuli", "…
$ goalie_involved <chr> "Elaine Chuli", "Elaine Chuli", "Elaine Chuli", "Elai…
$ shot_result <chr> "blocked", "made", "saved", "saved", "saved", "saved"…
$ on_ice_situation <chr> "Even Strength", "Even Strength", "Even Strength", "E…
$ home_score_total <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
$ away_score_total <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
- Assuming that each team made at least one shot in the season (is listed under the
shooting_teamcolumn), use dplyr and tidyr functions to create a data frame showing all the teams that were in this season and the total number of shots each team made throughout the season.
The column names of your table should be: Team and Shots
Click for solution
phf_teams <- phf_shots |>
group_by(shooting_team) |>
summarise(Shots = n()) |>
rename(Team = shooting_team)
phf_teams# A tibble: 6 × 2
Team Shots
<chr> <int>
1 Boston Pride 423
2 Buffalo Beauts 166
3 Connecticut Whale 172
4 Metropolitan Riveters 124
5 Minnesota Whitecaps 229
6 Toronto Toronto 388
- Using the data frame from Question 3, make a lollipop plot to visualize the number of shots made by each team in descending order of shots. Include a title and axis labels on your plot.
Click for solution
phf_teams <- phf_teams |>
mutate(Team = fct_reorder(Team, Shots))
ggplot(data = phf_teams, aes(x = Team, y = Shots)) +
geom_point() +
geom_segment(aes(xend = Team, y = 0, yend = Shots)) +
coord_flip() +
theme_minimal() +
labs(title = "Total Shots Taken by Each Team in the '21-'22 PHF Season",
y = "Total Number of Shots",
x = "Team")- Create a new variable called
is_goalthat is 1 whenshot_result == "made"and 0 otherwise. Hint: useif_else().
Click for solution
phf_shots <- phf_shots |>
mutate(is_goal = if_else(shot_result == "made", 1, 0))- The name of the player who made the shot is recorded under the column
player_name_1. Rename this column toshooter. Then create a bar plot to show (in descending order) the top ten most frequent shooters (shooters with the ten highest number of shots in the season) and the total number of shots they made.
Click for solution
top10_shooters <- phf_shots |>
rename(shooter = player_name_1) |>
group_by(shooter) |>
summarise(total_shots = n()) |>
arrange(desc(total_shots)) |>
slice(1:10) |>
mutate(shooter = fct_reorder(shooter, total_shots))
ggplot(data = top10_shooters, aes(x = shooter, y = total_shots)) +
geom_col(fill = "skyblue") +
coord_flip() +
theme_minimal() +
labs(title = "Top 10 Shooters of the '21-'22 PHF Season",
y = "Total Number of Shots",
x = "Shooter")- In hockey, save proportion measures how often a goalie successfully stops shots on goal. Shots on goal consist of shots that are recorded as “made” or “saved”, but do not include shots “blocked” as these are shots blocked by other defense players, not the goalie. Save proportion is calculated as the goalie’s (shots saved) / (total shots on goal).
- Using some dplyr and/or tidyr functions, create a data frame with each goalie’s save proportion.
Hint: The variables of interest here are goalie_involved (which shows the names of goalies) and shot_result, which has categories: blocked, made, saved. Since this exploration excludes blocked shots, start by filtering to exclude them.
Click for solution
phf_goalies <- phf_shots |>
filter(shot_result != "blocked") |>
group_by(goalie_involved) |>
summarise(
shots_on_goal = n(),
total_goals = sum(shot_result == "made"),
total_saves = sum(shot_result == "saved"),
save_prop = total_saves / shots_on_goal
)
phf_goalies# A tibble: 18 × 5
goalie_involved shots_on_goal total_goals total_saves save_prop
<chr> <int> <int> <int> <dbl>
1 Abbie Ives 95 11 84 0.884
2 Abbie Ives replaced by #1 18 2 16 0.889
3 Abbie Ives replaced by #3 8 3 5 0.625
4 Allie Morse replaced by # 30 3 27 0.9
5 Amanda Leveille 202 12 190 0.941
6 Brooke Wolejko 6 2 4 0.667
7 Brooke Wolejko replaced b 4 0 4 1
8 Carly Jackson 186 18 168 0.903
9 Carly Jackson replaced by 12 3 9 0.75
10 Elaine Chuli 147 14 133 0.905
11 Elaine Chuli replaced by 4 1 3 0.75
12 Lovisa Selander 188 12 176 0.936
13 Lovisa Selander replaced 4 0 4 1
14 Samantha Ridgewell 26 6 20 0.769
15 Sonjia Shelly 77 1 76 0.987
16 Tera Hofmann 36 3 33 0.917
17 Victoria Hanson 58 5 53 0.914
18 <NA> 3 3 0 0
- You should notice an NA value under
goalie_involvedafter completing this. Remove the row with that NA value from the final data frame.
Click for solution
phf_goalies <- phf_goalies |>
filter(!is.na(goalie_involved))
phf_goalies# A tibble: 17 × 5
goalie_involved shots_on_goal total_goals total_saves save_prop
<chr> <int> <int> <int> <dbl>
1 Abbie Ives 95 11 84 0.884
2 Abbie Ives replaced by #1 18 2 16 0.889
3 Abbie Ives replaced by #3 8 3 5 0.625
4 Allie Morse replaced by # 30 3 27 0.9
5 Amanda Leveille 202 12 190 0.941
6 Brooke Wolejko 6 2 4 0.667
7 Brooke Wolejko replaced b 4 0 4 1
8 Carly Jackson 186 18 168 0.903
9 Carly Jackson replaced by 12 3 9 0.75
10 Elaine Chuli 147 14 133 0.905
11 Elaine Chuli replaced by 4 1 3 0.75
12 Lovisa Selander 188 12 176 0.936
13 Lovisa Selander replaced 4 0 4 1
14 Samantha Ridgewell 26 6 20 0.769
15 Sonjia Shelly 77 1 76 0.987
16 Tera Hofmann 36 3 33 0.917
17 Victoria Hanson 58 5 53 0.914
- You should also notice that the
goalie_involvedvariable includes entries like “Allie Morse replaced by #”. Filter usinggoalie_involved %in% c()wherec()takes the names of goalies without “replaced…” entries.
Click for solution
phf_goalies <- phf_goalies |>
filter(goalie_involved %in% c(
"Abbie Ives", "Amanda Leveille", "Brooke Wolejko",
"Carly Jackson", "Elaine Chuli", "Samantha Ridgewell",
"Sonjia Shelly", "Tera Hofmann", "Victoria Hanson"
))
phf_goalies# A tibble: 9 × 5
goalie_involved shots_on_goal total_goals total_saves save_prop
<chr> <int> <int> <int> <dbl>
1 Abbie Ives 95 11 84 0.884
2 Amanda Leveille 202 12 190 0.941
3 Brooke Wolejko 6 2 4 0.667
4 Carly Jackson 186 18 168 0.903
5 Elaine Chuli 147 14 133 0.905
6 Samantha Ridgewell 26 6 20 0.769
7 Sonjia Shelly 77 1 76 0.987
8 Tera Hofmann 36 3 33 0.917
9 Victoria Hanson 58 5 53 0.914
- Using the data frame from your answer to Question 7c above, create a visualization of your choice to display goalies’ save proportions in ascending order. Use
coord_flip()such that the goalie variable is on the y axis and save proportion is on the x axis. Remember to use descriptive axis labels and a title.
Click for solution
# A lollipop plot, bar plot, or dot plot would all work here.
phf_goalies <- phf_goalies |>
mutate(goalie_involved = fct_reorder(goalie_involved, save_prop))
ggplot(phf_goalies, aes(x = save_prop, y = goalie_involved)) +
geom_point(size = 3, colour = "skyblue") +
theme_minimal() +
labs(title = "Goalie Save Proportions in the '21-'22 PHF Season",
x = "Save Proportion",
y = "Goalie")- So far, we have compared goalies using ordered plots. Another useful way to understand data is by examining its distribution.
- Using the
phf_goaliesdata frame from Question 7c, create a histogram of goalie save proportions. Choose an appropriate bin width, and include a title and axis labels.
Click for solution
ggplot(phf_goalies, aes(x = save_prop)) +
geom_histogram(binwidth = 0.02, fill = "skyblue") +
theme_minimal() +
labs(title = "Distribution of Goalie Save Proportions",
x = "Save Proportion",
y = "Count")- Based on your histogram, describe the distribution of save proportions. In your answer, consider the following:
Is the distribution symmetric, left-skewed, or right-skewed?
Are there any noticeable outliers?
Click for solution
Most goalies have a save proportion between about 0.87 and 0.95. The distribution is left-skewed, with one goalie posting a save proportion below 0.70 — a potential outlier well below the rest of the group.