library(tidyverse)
RR_dnf_data <- read_csv("raisin_river_DNF.csv")The Impact of Outliers: The Raisin River Canoe Race | SOLUTIONS
Read the raisin_river_DNF.csv dataset
Explore the relationship between flow rate and proportion of DNFs
1. Create a scatterplot with regression line that shows flow rate as a predictor for proportion of DNF.
ggplot(RR_dnf_data, aes(x = flow, y = prop_DNF)) +
geom_point() +
geom_smooth(method="lm",se=FALSE) +
labs(x = "Flow Rate (ft^3/sec)",
y = "Proportion of DNFs",
title = "Using flow rate to predict DNF proportion")2. Assess the fitted model.
a) Interpret the value of the slope coefficient of flow in context.
Fit the model to predict proportion DNF using flow
dnf_mod <- lm(prop_DNF ~ flow, RR_dnf_data)
summary(dnf_mod)
Call:
lm(formula = prop_DNF ~ flow, data = RR_dnf_data)
Residuals:
Min 1Q Median 3Q Max
-0.06308 -0.04667 -0.01621 0.02705 0.09522
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.916e-02 3.810e-02 2.078 0.083 .
flow -7.061e-06 3.273e-05 -0.216 0.836
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.06784 on 6 degrees of freedom
Multiple R-squared: 0.007698, Adjusted R-squared: -0.1577
F-statistic: 0.04654 on 1 and 6 DF, p-value: 0.8363
INTERPRETATION: For every 1000 ft^3/sec increase in flow, we can expect a 0.00706 decrease in the proportion of DNFs.
b) What does the p-value for that coefficient indicate about the relationship between flow rate and proportion of DNF?
The p-value for the flow coefficient is 0.836, which means there is no convincing evidence for flow being a useful predictor of proportion DNF in the linear model made using this dataset.
c) Comment on the appropriateness of the model
Based on the scatterplot and the model summary, a linear model does not appear to be appropriate. There is one observation with a very high flow rate and low DNF influencing the model too much for it to make sense.
Investigate unusual/influential Observations
3. Identify any observations that have unusual leverage, standardized residuals, or influence.
leverage:
sort(hatvalues(dnf_mod),decreasing=TRUE) 3 1 8 5 4 6 2 7
0.8334269 0.2100166 0.1889992 0.1821738 0.1730529 0.1425218 0.1417639 0.1280448
For the sample size of 8, a mildly unusual leverage is beyond 4/8=0.5 and more extreme leverage would be beyond 6/8=0.75. Case 3 (2017) appears to have a very high leverage (0.83).
standardized residuals:
sort(rstandard(dnf_mod),decreasing=TRUE) 5 7 6 1 4 2
1.54884164 1.50303894 0.06984195 -0.22293322 -0.30770634 -0.69642540
8 3
-1.03244337 -1.99923075
None of the standardized residuals are beyond \(\pm 2\), so there are no extreme standardized residuals - although the -1.99923 value for Case 3 (2017) is very close!
Cook’s D:
sort(cooks.distance(dnf_mod),decreasing=TRUE) 3 5 7 8 2 4
9.9990439892 0.2671831166 0.1658739120 0.1242056295 0.0400569681 0.0099070436
1 6
0.0066062543 0.0004053789
Case 3 (2017) has a Cook’s D value (10.0) which is way beyond the threshold of 1, indicating that this point has an unusually large influence on the regression fit.
Consider the outlier
Your analysis from #3 should show that the 2017 race with a flow rate of 2649 ft3/sec and 0.0051 DNF proportion is unusual. While we shouldn’t automatically exclude any unusual case, we should investigate, when possible, to see if there are circumstances that might warrant dropping that case. In this situation, the water level was dangerously high in 2017, causing the race officials to move the race start to Delaney Road (roughly 4 miles further down the river – see the map). Based on that domain knowledge, we should consider re-doing the analysis without the 2017 data.
4. Remove the influential (2017) observation from the data set.
RR_dnf_data2 <- RR_dnf_data |> filter(year != 2017)a) Draw a new scatterplot and show the revised line of best fit. How does this plot compare to the earlier fit?
ggplot(RR_dnf_data2, aes(x = flow, y = prop_DNF)) +
geom_point() +
geom_smooth(method = lm, formula=y~x,se = FALSE) +
labs(x = "Flow Rate (ft^3/sec)",
y = "Proportion of DNFs",
title = "Using flow rate to predict DNF proportion")With the 2017 data point omitted, we see a positive association between the flow rate and proportion of DNF. There was a slight negative slope in the original plot.
b) Give a revised interpretation for the fitted slope (in context) and discuss the effectiveness of flow as a predictor for proportion DNF in the absence of the 2017 outlier.
dnf_mod <- lm(prop_DNF ~ flow, RR_dnf_data2)
summary(dnf_mod)
Call:
lm(formula = prop_DNF ~ flow, data = RR_dnf_data2)
Residuals:
1 2 3 4 5 6 7
0.0265737 -0.0490948 0.0007913 -0.0133970 -0.0001275 0.0691113 -0.0338569
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.338e-03 3.509e-02 -0.038 0.9710
flow 1.279e-04 4.748e-05 2.693 0.0431 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.04294 on 5 degrees of freedom
Multiple R-squared: 0.592, Adjusted R-squared: 0.5104
F-statistic: 7.254 on 1 and 5 DF, p-value: 0.04312
INTERPRETATION: For every 100 ft^3/sec increase in flow, we expect about a 0.0128 increase in the proportion of DNFs. The p-value for testing this slope (0.0431) is less that 0.05, so we have mildly convincing evidence that a faster flow rate is associated with a higher proportion of boats not finishing the race.
c) If the flow level is 1100 ft^3/sec and there are 200 competitors about how many DNFs can we expect?
-0.001338 + (0.0001279*1100)[1] 0.139352
200 * 0.139352[1] 27.8704
If flow rate is 1100 ft^3/sec and there are 200 competitors, we can expect about 28 DNFs.