Is logistic regression resistant to outliers?



Is logistic regression resistant to outliers?




We speculate that the S-shaped sigmoid function is forgiving of outliers in x as long as one is “on the right side”, i.e. if the class label does not contradict the general trend of the variable. For example, in the Titanic data we have seen that survival probability tended to decline with increasing age. What if we added a \(500\)-year old person to the data set who did not survive? For linear regression such an outlier would likely distort the coefficient estimates significantly.

fit = glm(Survived ~ Age , data= NoMissingAge, family = binomial(link=logit))
summary(fit)
## 
## Call:
## glm(formula = Survived ~ Age, family = binomial(link = logit), 
##     data = NoMissingAge)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1488  -1.0361  -0.9544   1.3159   1.5908  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.05672    0.17358  -0.327   0.7438  
## Age         -0.01096    0.00533  -2.057   0.0397 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 960.23  on 712  degrees of freedom
## AIC: 964.23
## 
## Number of Fisher Scoring iterations: 4
dataWithOutlier = NoMissingAge
dataWithOutlier[1,c("Age", "Survived")] = c(500,0) 
fit = glm(Survived ~ Age , data= dataWithOutlier, family = binomial(link=logit))
summary(fit)
## 
## Call:
## glm(formula = Survived ~ Age, family = binomial(link = logit), 
##     data = dataWithOutlier)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1515  -1.0373  -0.9545   1.3145   1.5928  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.050245   0.172124  -0.292   0.7704  
## Age         -0.011100   0.005271  -2.106   0.0352 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 959.12  on 712  degrees of freedom
## AIC: 963.12
## 
## Number of Fisher Scoring iterations: 4
#what do the diagnostic plots tell us?
#plot(fit)
#and now we change the label which leads to a large residual:
dataWithOutlier[1,"Survived"] = 1 
fit = glm(Survived ~ Age , data= dataWithOutlier, family = binomial(link=logit))
summary(fit)
## 
## Call:
## glm(formula = Survived ~ Age, family = binomial(link = logit), 
##     data = dataWithOutlier)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.039  -1.026  -1.015   1.336   1.631  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.332154   0.131543  -2.525   0.0116 *
## Age         -0.001383   0.003553  -0.389   0.6970  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 965.27  on 713  degrees of freedom
## Residual deviance: 965.11  on 712  degrees of freedom
## AIC: 969.11
## 
## Number of Fisher Scoring iterations: 4

Wow, one data point changes the entire fit significantly in the second case but left it virtually alone in the first.



kaggle@HWR

“Kaggling” at HWR has been used as a successful teaching tool since Fall 2014 when the first batch of “Business Intelligence and Process Management” (BIPM) master students embarked upon 2 competitions in the course Learning from structured Data taught by Prof. Loecher.

While the participation was never meant to be a serious attempt at competing with the many hard working and highly talented teams in the world, we are proud to announce that we actually made the top 10 in the San Francisco Crime Classification leaderboard, yeah!

kaggle_SFCrime_SubmissionTop10

UPDATE: The competition is officially over, and we are very happy to announce that we ended up in 2nd place:

kaggle_SFCrime_SubmissionTop2