We speculate that the S-shaped sigmoid function is forgiving of outliers in x as long as one is “on the right side”, i.e. if the class label does not contradict the general trend of the variable. For example, in the Titanic data we have seen that survival probability tended to decline with increasing age. What if we added a \(500\)-year old person to the data set who did not survive? For linear regression such an outlier would likely distort the coefficient estimates significantly.
fit = glm(Survived ~ Age , data= NoMissingAge, family = binomial(link=logit))
summary(fit)
##
## Call:
## glm(formula = Survived ~ Age, family = binomial(link = logit),
## data = NoMissingAge)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1488 -1.0361 -0.9544 1.3159 1.5908
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.05672 0.17358 -0.327 0.7438
## Age -0.01096 0.00533 -2.057 0.0397 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 964.52 on 713 degrees of freedom
## Residual deviance: 960.23 on 712 degrees of freedom
## AIC: 964.23
##
## Number of Fisher Scoring iterations: 4
dataWithOutlier = NoMissingAge
dataWithOutlier[1,c("Age", "Survived")] = c(500,0)
fit = glm(Survived ~ Age , data= dataWithOutlier, family = binomial(link=logit))
summary(fit)
##
## Call:
## glm(formula = Survived ~ Age, family = binomial(link = logit),
## data = dataWithOutlier)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1515 -1.0373 -0.9545 1.3145 1.5928
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.050245 0.172124 -0.292 0.7704
## Age -0.011100 0.005271 -2.106 0.0352 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 964.52 on 713 degrees of freedom
## Residual deviance: 959.12 on 712 degrees of freedom
## AIC: 963.12
##
## Number of Fisher Scoring iterations: 4
#what do the diagnostic plots tell us?
#plot(fit)
#and now we change the label which leads to a large residual:
dataWithOutlier[1,"Survived"] = 1
fit = glm(Survived ~ Age , data= dataWithOutlier, family = binomial(link=logit))
summary(fit)
##
## Call:
## glm(formula = Survived ~ Age, family = binomial(link = logit),
## data = dataWithOutlier)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.039 -1.026 -1.015 1.336 1.631
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.332154 0.131543 -2.525 0.0116 *
## Age -0.001383 0.003553 -0.389 0.6970
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 965.27 on 713 degrees of freedom
## Residual deviance: 965.11 on 712 degrees of freedom
## AIC: 969.11
##
## Number of Fisher Scoring iterations: 4
Wow, one data point changes the entire fit significantly in the second case but left it virtually alone in the first.
// add bootstrap table styles to pandoc tables $(document).ready(function () { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); });