t tests and classifiers









On March 10th, 2016 Keyvan asked the following question on researchgate:

“Is it true to say that if Two-sample t-test shows that two data are significantly different, then they are good feature for classification purposes?”

Alas, the answer is not necessarily (in fact, more likely not) ! Well,it all depends on what you expect from a “good” feature.

The problem is the aggregate nature of the t.test: it simply detects differences in the estimated means of the two populations. For large enough sample sizes, even with very heavy overlap in the two distributions the t test will be significant but the separation of the individual data points is poor!

Even though we are not in the hypothesis test setting, it is helpful to define the discriminative power of a feature by its respective type-I and type-II errors.

Remember, with any decision rule we can be wrong in two different ways:

  • Type I error: We reject the Null hypothesis, even though it is true. This error rate is denoted by \(\alpha\).
  • Type II error: We fail to reject the Null hypothesis, even though it is wrong. This error rate is denoted by \(\beta\).

A closely related concept is the power of a test. All three measures are visualized in the following two plots where we choose a symmetric classification rule (\(N=10^3\)):

I would expect us to conclude that the left panel shows a (very) good feature for classification purposes whereas the right panel not at all.

Would you be surprised to learn that in both cases the t tests are highly significant? Here it is for the left panel

## 
##  Welch Two Sample t-test
## 
## data:  s1 and s2
## t = -74.234, df = 1995.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.010219987 -0.009693888
## sample estimates:
##  mean of x  mean of y 
## 0.03995317 0.04991011

and for the right panel

## 
##  Welch Two Sample t-test
## 
## data:  s1 and s2
## t = -7.6674, df = 1997.2, p-value = 2.721e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.0013042349 -0.0007729414
## sample estimates:
##  mean of x  mean of y 
## 0.03982333 0.04086192

Now do not conclude that the t test gives useless or silly results. It simply answers a very different question from a classification purpose: “Could it be that the two population means are equal?”

This very nice post on the win-vector blog shows another example of the gap between prediction and aggregate inference.

And all of the above is in the best case of perfect normality! Of course, as Fabrice already pointed out, with strong deviations from normality, the results from the t.test are even less useful.



Categorical Variables in Trees I








I find it remarkable that very few of the current implementations of tree algorithms in R exploit an important “trick” which was mentioned in the original seminal paper in 1984!

“For a two-class problem, the search for the best categorical split can be reduced to M steps using the Gini criterion”

Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees , Wadsworth International Group

See also http://www.math.ccu.edu.tw/~yshih/papers/spl.pdf

and p. 11 in http://www.cs.ubbcluj.ro/~csatol/mestint/pdfs/Gehrke_Loh_DecTree.pdf

(The effect of dummifying categorical variables on performance is nicely elaborated upon in this post Are categorical variables getting lost in your random forests?)

This implies that instead of having to search \(2^k-1\) combinations of the \(k\) levels it is sufficient to try just k of them !

The practical consequences of this exponential vs. linear scaling are quite grave, especially with modern datasets that often contain lots of categorical variables with many (not rare to see \(k> 1000\)) levels.

Of the popular packages that I tried, only rpart, gbm and RWeka avoid the “exponential trap” and pose (almost) no limit on the number of levels:

#read in the Rossman data set from this kaggle competition:
#  https://www.kaggle.com/c/rossmann-store-sales
#train = read.csv('H:/kaggle/Rossmann/input/train.csv.gz',as.is=T)

#  oh well, fake data serve to illustrate the point just as well:
N=5*10^5
train = cbind.data.frame(Sales= 0, Store=sample(1:1000,N,replace=TRUE))
train$Sales = 0.1*train$Store +rnorm(N)
train$Store = factor(train$Store)

cat ("There are ", length(levels(train$Store)),"stores/levels in this dataset.\n")
## There are  1000 stores/levels in this dataset.

Libraries that fail to exploit the linear search strategy:

The tree package

library(tree, quietly=TRUE)
try({fit = tree(Sales ~ Store, data = train)})

The party package

library(party, quietly=TRUE)
try({fit = ctree(Sales ~ Store, data = train)})
detach("package:party", unload=TRUE)
## Warning: Error in matrix(0, nrow = mi, ncol = nl) : 
##   invalid 'nrow' value (too large or NA)
## In addition: Warning message:
## In matrix(0, nrow = mi, ncol = nl) :
##   NAs introduced by coercion to integer range

The partykit package

library(partykit, quietly=TRUE)
try({fit = lmtree(Sales ~ Store, data = train)})
detach("package:partykit", unload=TRUE)
## Warning: Error in matrix(0, nrow = mi, ncol = nl) : 
##   invalid 'nrow' value (too large or NA)
## In addition: Warning message:
## In matrix(0, nrow = mi, ncol = nl) :
##   NAs introduced by coercion to integer range

The randomForest package

library(randomForest, quietly=TRUE)
try({fit = randomForest(Sales ~ Store, data = train,  ntree=1)})
## Warning: Error in randomForest.default(m, y, ...) : 
##   Can not handle categorical predictors with more than 53 categories.

base lm runs out of memory

try({fit0 = lm(Sales ~ Store, data = train)})
## Warning: Cannot allocate vector of size...

Libraries that shine:

The rpart package

library(rpart, quietly=TRUE)
fit = rpart(Sales ~ Store, data = train)

The gbm package

Well, almost. At least the max number of levels is very high (1024):

library(gbm, quietly=TRUE)
fit = gbm(Sales ~ Store, data = train, interaction.depth = 8, n.trees=1)
## Distribution not specified, assuming gaussian ...

(Not clear to me why there is a limit at all)

The RWeka package

library(RWeka, quietly=TRUE)
#handles only classification problems:
train$Sales2 = train$Sales > mean(train$Sales)
fit = J48(Sales2 ~ Store, data = train)

The h2o package