Is logistic regression resistant to outliers?

We speculate that the S-shaped sigmoid function is forgiving of outliers in x as long as one is “on the right side”, i.e. if the class label does not contradict the general trend of the variable. For example, in the Titanic data we have seen that survival probability tended to decline with increasing age. What if we added a \(500\)-year old person to the data set who did not survive? For linear regression such an outlier would likely distort the coefficient estimates significantly.

fit = glm(Survived ~ Age , data= NoMissingAge, family = binomial(link=logit))
summary(fit)

## 
## Call:
## glm(formula = Survived ~ Age, family = binomial(link = logit), 
##     data = NoMissingAge)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1488  -1.0361  -0.9544   1.3159   1.5908  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.05672    0.17358  -0.327   0.7438  
## Age         -0.01096    0.00533  -2.057   0.0397 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 960.23  on 712  degrees of freedom
## AIC: 964.23
## 
## Number of Fisher Scoring iterations: 4

dataWithOutlier = NoMissingAge
dataWithOutlier[1,c("Age", "Survived")] = c(500,0) 
fit = glm(Survived ~ Age , data= dataWithOutlier, family = binomial(link=logit))
summary(fit)

## 
## Call:
## glm(formula = Survived ~ Age, family = binomial(link = logit), 
##     data = dataWithOutlier)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1515  -1.0373  -0.9545   1.3145   1.5928  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.050245   0.172124  -0.292   0.7704  
## Age         -0.011100   0.005271  -2.106   0.0352 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 959.12  on 712  degrees of freedom
## AIC: 963.12
## 
## Number of Fisher Scoring iterations: 4

#what do the diagnostic plots tell us?
#plot(fit)
#and now we change the label which leads to a large residual:
dataWithOutlier[1,"Survived"] = 1 
fit = glm(Survived ~ Age , data= dataWithOutlier, family = binomial(link=logit))
summary(fit)

## 
## Call:
## glm(formula = Survived ~ Age, family = binomial(link = logit), 
##     data = dataWithOutlier)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.039  -1.026  -1.015   1.336   1.631  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.332154   0.131543  -2.525   0.0116 *
## Age         -0.001383   0.003553  -0.389   0.6970  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 965.27  on 713  degrees of freedom
## Residual deviance: 965.11  on 712  degrees of freedom
## AIC: 969.11
## 
## Number of Fisher Scoring iterations: 4

Wow, one data point changes the entire fit significantly in the second case but left it virtually alone in the first.

Leave a Reply Cancel reply