The Impact of Outliers: The Raisin River Canoe Race | SOLUTIONS

Read the raisin_river_DNF.csv dataset

library(tidyverse)

RR_dnf_data <- read_csv("raisin_river_DNF.csv")

Explore the relationship between flow rate and proportion of DNFs

1. Create a scatterplot with regression line that shows flow rate as a predictor for proportion of DNF.

1 . Solution

ggplot(RR_dnf_data, aes(x = flow, y = prop_DNF)) +
  geom_point() +
  geom_smooth(method="lm",se=FALSE) +
  labs(x = "Flow Rate (ft^3/sec)",
       y = "Proportion of DNFs",
       title = "Using flow rate to predict DNF proportion")

2. Assess the fitted model.

a) Interpret the value of the slope coefficient of flow in context.

2a. Solution

Fit the model to predict proportion DNF using flow

dnf_mod <- lm(prop_DNF ~ flow, RR_dnf_data)
summary(dnf_mod)


Call:
lm(formula = prop_DNF ~ flow, data = RR_dnf_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.06308 -0.04667 -0.01621  0.02705  0.09522 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  7.916e-02  3.810e-02   2.078    0.083 .
flow        -7.061e-06  3.273e-05  -0.216    0.836  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06784 on 6 degrees of freedom
Multiple R-squared:  0.007698,  Adjusted R-squared:  -0.1577 
F-statistic: 0.04654 on 1 and 6 DF,  p-value: 0.8363

INTERPRETATION: For every 1000 ft^3/sec increase in flow, we can expect a 0.00706 decrease in the proportion of DNFs.

b) What does the p-value for that coefficient indicate about the relationship between flow rate and proportion of DNF?

2b. Solution

The p-value for the flow coefficient is 0.836, which means there is no convincing evidence for flow being a useful predictor of proportion DNF in the linear model made using this dataset.

c) Comment on the appropriateness of the model

2c. Solution

Based on the scatterplot and the model summary, a linear model does not appear to be appropriate. There is one observation with a very high flow rate and low DNF influencing the model too much for it to make sense.

Investigate unusual/influential Observations

3. Identify any observations that have unusual leverage, standardized residuals, or influence.

3 . Solution (leverage)

leverage:

sort(hatvalues(dnf_mod),decreasing=TRUE)

        3         1         8         5         4         6         2         7 
0.8334269 0.2100166 0.1889992 0.1821738 0.1730529 0.1425218 0.1417639 0.1280448

For the sample size of 8, a mildly unusual leverage is beyond 4/8=0.5 and more extreme leverage would be beyond 6/8=0.75. Case 3 (2017) appears to have a very high leverage (0.83).

3 . Solution (standardized residual)

standardized residuals:

sort(rstandard(dnf_mod),decreasing=TRUE)

          5           7           6           1           4           2 
 1.54884164  1.50303894  0.06984195 -0.22293322 -0.30770634 -0.69642540 
          8           3 
-1.03244337 -1.99923075

None of the standardized residuals are beyond \(\pm 2\), so there are no extreme standardized residuals - although the -1.99923 value for Case 3 (2017) is very close!

3 . Solution (influence)

Cook’s D:

sort(cooks.distance(dnf_mod),decreasing=TRUE)

           3            5            7            8            2            4 
9.9990439892 0.2671831166 0.1658739120 0.1242056295 0.0400569681 0.0099070436 
           1            6 
0.0066062543 0.0004053789

Case 3 (2017) has a Cook’s D value (10.0) which is way beyond the threshold of 1, indicating that this point has an unusually large influence on the regression fit.

Consider the outlier

Your analysis from #3 should show that the 2017 race with a flow rate of 2649 ft3/sec and 0.0051 DNF proportion is unusual. While we shouldn’t automatically exclude any unusual case, we should investigate, when possible, to see if there are circumstances that might warrant dropping that case. In this situation, the water level was dangerously high in 2017, causing the race officials to move the race start to Delaney Road (roughly 4 miles further down the river – see the map). Based on that domain knowledge, we should consider re-doing the analysis without the 2017 data.

4. Remove the influential (2017) observation from the data set.

4 . Solution (remove outier)

RR_dnf_data2 <- RR_dnf_data |> filter(year != 2017)

a) Draw a new scatterplot and show the revised line of best fit. How does this plot compare to the earlier fit?

4a. Solution

ggplot(RR_dnf_data2, aes(x = flow, y = prop_DNF)) +
  geom_point() +
  geom_smooth(method = lm, formula=y~x,se = FALSE) +
  labs(x = "Flow Rate (ft^3/sec)",
       y = "Proportion of DNFs",
       title = "Using flow rate to predict DNF proportion")

With the 2017 data point omitted, we see a positive association between the flow rate and proportion of DNF. There was a slight negative slope in the original plot.

b) Give a revised interpretation for the fitted slope (in context) and discuss the effectiveness of flow as a predictor for proportion DNF in the absence of the 2017 outlier.

4b. Solution

dnf_mod <- lm(prop_DNF ~ flow, RR_dnf_data2)
summary(dnf_mod)


Call:
lm(formula = prop_DNF ~ flow, data = RR_dnf_data2)

Residuals:
         1          2          3          4          5          6          7 
 0.0265737 -0.0490948  0.0007913 -0.0133970 -0.0001275  0.0691113 -0.0338569 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -1.338e-03  3.509e-02  -0.038   0.9710  
flow         1.279e-04  4.748e-05   2.693   0.0431 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.04294 on 5 degrees of freedom
Multiple R-squared:  0.592, Adjusted R-squared:  0.5104 
F-statistic: 7.254 on 1 and 5 DF,  p-value: 0.04312

INTERPRETATION: For every 100 ft^3/sec increase in flow, we expect about a 0.0128 increase in the proportion of DNFs. The p-value for testing this slope (0.0431) is less that 0.05, so we have mildly convincing evidence that a faster flow rate is associated with a higher proportion of boats not finishing the race.

c) If the flow level is 1100 ft^3/sec and there are 200 competitors about how many DNFs can we expect?

4c. Solution

-0.001338 + (0.0001279*1100)

[1] 0.139352

200 * 0.139352

[1] 27.8704

If flow rate is 1100 ft^3/sec and there are 200 competitors, we can expect about 28 DNFs.