Pure Interaction Terms: The Witches









In his brilliant book The Witches, Roald Dahl gives a probabilistic description of witch features:

How to Recognise a Witch

The next evening, after my grandmother had given me my bath, she took me once again into the living-room for another story. “Tonight,” the old woman said, “I am going to tell you how to recognise a witch when you see one.” “Can you always be sure?” I asked. “No,” she said, “you can’t. And that’s the trouble. But you can make a pretty good guess.” She was dropping cigar ash all over her lap, and I hoped she wasn’t going to catch on fire before she’d told me how to recognise a witch. “In the first place,” she said, “a REAL WITCH is certain always to be wearing gloves when you meet her.”

“Surely not always ,” I said. “What about in the summer when it’s hot?” “Even in the summer,” my grandmother said. “She has to. Do you want to know why?” “Why?” I said. “Because she doesn’t have fingernails. Instead of fingernails, she has thin curvy claws, like a cat, and she wears the gloves to hide them. Mind you, lots of very respectable women wear gloves, especially in winter, so this doesn’t help you very much.” “Mamma used to wear gloves,” I said. “Not in the house,” my grandmother said. “Witches wear gloves even in the house. They only take them off when they go to bed.”

And he goes on with wigs, scratching scalps, nose holes, their eyes, blue spit and the missing toes 🙂 Like an experienced data analyst he points out the power of pooling evidence:

“None of these things is any good on its own,” my grandmother said. “It’s only when you put them all together that they begin to make a little sense.

A splendid narration of this chapter can be found here:

I find the gloves a wonderful example for the natural absence of main effects with a pure interaction term.

This topic continues to receive a lot of attention in the statistical community, e.g.:

Including the interaction but not the main effects in a model.

I need some examples of data where there is an interaction and with/without main effects.

The general advice seems to be that “one should never include an interaction in a model without the corresponding main effects”

I find the gloves a wonderful example for the natural absence of main effects with a pure interaction term.

Let us simulate data with the following parameters: \(P(witch) = 0.1\)

\(P(glove | \neg witch \wedge summer) = 0.15\) (15% of normal women wear gloves in summer)

\(P(glove | \neg witch \wedge winter) = 0.9\) (90% of normal women wear gloves in winter)

\(P(glove | witch ) = 1\) (all witches wear gloves in all seasons)

So observing a gloved woman in summer or winter results in the following probabilities of being a witch:

\[P(witch | glove \wedge summer) = P(witch \wedge glove \wedge summer)/P(glove \wedge summer)\] \[= \frac{P(witch \wedge glove \wedge summer)}{(P(witch \wedge glove \wedge summer) + P(\neg witch \wedge glove \wedge summer)}\] \[=\frac{P(witch) \cdot P(summer | witch) \cdot P(glove | witch \wedge summer)}{P(witch) \cdot P(summer | witch) \cdot P(glove | witch \wedge summer) + P(\neg witch) \cdot P(summer | \neg witch) \cdot P(glove | \neg witch \wedge summer)}\] \[= \frac{0.1 \cdot 0.5 \cdot 1}{0.1 \cdot 0.5 \cdot 1+0.9 \cdot 0.5 \cdot 0.15} = 0.425\]

(Here we use the conditional independence \(P(summer | witch) = P(summer | \neg witch) = P(summer )\))

\[P(witch | glove \wedge winter) = P(witch \wedge glove \wedge winter)/P(glove \wedge winter)\] \[= \frac{P(witch \wedge glove \wedge winter)}{P(witch \wedge glove \wedge winter) + P(\neg witch \wedge glove \wedge winter)}\] \[= \frac{0.1 \cdot 0.5 \cdot 1}{0.1 \cdot 0.5 \cdot 1+0.9 \cdot 0.5 \cdot 0.85} = 0.116\]

Observing a gloved woman in any season:

\(P(witch | glove) =0.1/(0.1+0.9 \cdot 0.525)=0.1746\)

(The probability of a non-witch wearing gloves in any season is \((0.15+0.9)/2=0.525\))

#generate data:
#
set.seed(134)
N = 10000
x = cbind.data.frame(witch = sample(c(1,0),N,rep=TRUE,p=c(0.1,0.9)), season=factor(sample(c("summer", "winter"),N, rep=TRUE), levels= c("winter", "summer")), gloves = 0)
w = which(x$witch==1)
x[w,]$gloves = 1
# 15% of normal women wear gloves in summer
ii=which(x$witch==0 & x$season=="summer")
x[ii,]$gloves = sample(c(1,0),length(ii),rep=TRUE,p=c(0.15,0.85))
# 90% of normal women wear gloves in winter
ii=which(x$witch==0 & x$season=="winter")
x[ii,]$gloves = sample(c(1,0),length(ii),rep=TRUE,p=c(0.9,0.1))

The linear model is easier to interpret than the glm:

  1. Main effects only:
fit1 = lm(witch ~ season + gloves -1, data=x)
#summary(fit1)
print(summary(fit1)$coeff)
##                 Estimate  Std. Error   t value      Pr(>|t|)
## seasonwinter -0.19760411 0.007953069 -24.84627 2.767367e-132
## seasonsummer  0.02372498 0.004327188   5.48277  4.289022e-08
## gloves        0.32785212 0.007613676  43.06095  0.000000e+00
predict(fit1,new=cbind.data.frame(gloves=1,season="summer"), type="response")
##         1 
## 0.3515771
  1. Only Interactions
fit2 = lm(witch ~ season:gloves -1, data=x)
#summary(fit2)
print(summary(fit2)$coeff)
##                      Estimate  Std. Error  t value      Pr(>|t|)
## seasonwinter:gloves 0.1104267 0.004035316 27.36508 4.692681e-159
## seasonsummer:gloves 0.4266667 0.007854195 54.32341  0.000000e+00
predict(fit2,new=cbind.data.frame(gloves=1,season="summer"), type="response")
##         1 
## 0.4266667
# fit2a = lm(witch ~ season:gloves, data=x)
# #summary(fit2)
# print(summary(fit2a)$coeff)
# predict(fit2a,new=cbind.data.frame(gloves=1,season="summer"), type="response")
  1. Interactions and one main effect:
fit3 = lm(witch ~ gloves:season + gloves-1, data=x)
#summary(fit3)
print(summary(fit3)$coeff)
##                       Estimate  Std. Error   t value      Pr(>|t|)
## gloves               0.4266667 0.007854195  54.32341  0.000000e+00
## gloves:seasonwinter -0.3162399 0.008830184 -35.81351 2.134732e-264
predict(fit3,new=cbind.data.frame(gloves=1,season="summer"), type="response")
## Warning in predict.lm(fit3, new = cbind.data.frame(gloves = 1, season =
## "summer"), : prediction from a rank-deficient fit may be misleading
##         1 
## 0.4266667
  1. Interactions and all main effects:
fit4 = lm(witch ~ season*gloves -1, data=x)
#summary(fit3)
print(summary(fit4)$coeff)
##                          Estimate Std. Error       t value     Pr(>|t|)
## seasonwinter         1.665221e-14 0.01274246  1.306828e-12 1.000000e+00
## seasonsummer        -1.973059e-14 0.00441528 -4.468706e-12 1.000000e+00
## gloves               1.104267e-01 0.01336628  8.261594e+00 1.619632e-16
## seasonsummer:gloves  3.162399e-01 0.01611995  1.961792e+01 4.103302e-84
predict(fit4,new=cbind.data.frame(gloves=1,season="summer"), type="response")
##         1 
## 0.4266667

How does the tree look like?

Inferential tree first, then generative tree:

library(partykit)
ct = ctree(factor(witch) ~ season +gloves, data=x)
plot(ct)

ct = ctree(factor(gloves) ~ season +witch, data=x)
plot(ct)

Unbalanced Data (70% summer) AND low sample size:

Now, observing a gloved woman in summer or winter results in the following probabilities of being a witch:

\(P(witch | glove \wedge summer) = P(witch \wedge glove \wedge summer)/P(glove \wedge summer)\) \(= P(witch \wedge glove \wedge summer)/(P(witch \wedge glove \wedge summer) + P(\neg witch \wedge glove \wedge summer))\) \(= 0.1*0.7*1/(0.1*0.7*1+0.9*0.7*0.15) = 0.425\)

\(P(witch | glove \wedge winter) = P(witch \wedge glove \wedge winter)/P(glove \wedge winter)\) \(= P(witch \wedge glove \wedge winter)/(P(witch \wedge glove \wedge winter) + P(\neg witch \wedge glove \wedge winter))\) \(= 0.1*0.3*1/(0.1*0.3*1+0.9*0.3*0.85) = 0.116\)

Observing a gloved woman in any season:

\(P(witch | glove) =0.1/(0.1+0.9*0.375)=0.228\)

(The probability of a non-witch wearing gloves in any season is \((0.15*0.7+0.9*0.3)=0.375\))

#generate data:
#
set.seed(134)
N = 400
x = cbind.data.frame(witch = sample(c(1,0),N,rep=TRUE,p=c(0.1,0.9)), season=factor(sample(c("summer", "winter"),N, rep=TRUE, p = c(0.7,0.3)), levels= c("winter", "summer")), gloves = 0)
w = which(x$witch==1)
x[w,]$gloves = 1
# 15% of normal women wear gloves in summer
ii=which(x$witch==0 & x$season=="summer")
x[ii,]$gloves = sample(c(1,0),length(ii),rep=TRUE,p=c(0.15,0.85))
# 90% of normal women wear gloves in winter
ii=which(x$witch==0 & x$season=="winter")
x[ii,]$gloves = sample(c(1,0),length(ii),rep=TRUE,p=c(0.9,0.1))
  1. Main effects only:
fit1 = lm(witch ~ season + gloves -1, data=x)
#summary(fit1)
print(summary(fit1)$coeff)
##                 Estimate Std. Error    t value     Pr(>|t|)
## seasonwinter -0.21824679 0.04089194 -5.3371595 1.591126e-07
## seasonsummer  0.01076554 0.01797279  0.5989908 5.495207e-01
## gloves        0.39225394 0.03506834 11.1854147 1.961967e-25
predict(fit1,new=cbind.data.frame(gloves=1,season="summer"), type="response")
##         1 
## 0.4030195
  1. Only Interactions
fit2 = lm(witch ~ season:gloves -1, data=x)
#summary(fit2)
print(summary(fit2)$coeff)
##                      Estimate Std. Error  t value     Pr(>|t|)
## seasonwinter:gloves 0.1500000 0.02699692  5.55619 5.055488e-08
## seasonsummer:gloves 0.4393939 0.03323091 13.22245 2.432912e-33
predict(fit2,new=cbind.data.frame(gloves=1,season="summer"), type="response")
##         1 
## 0.4393939
# fit2a = lm(witch ~ season:gloves, data=x)
# #summary(fit2)
# print(summary(fit2a)$coeff)
# predict(fit2a,new=cbind.data.frame(gloves=1,season="summer"), type="response")
  1. Interactions and one main effect:
fit3 = lm(witch ~ gloves:season + gloves -1, data=x)
#summary(fit3)
print(summary(fit3)$coeff)
##                       Estimate Std. Error   t value     Pr(>|t|)
## gloves               0.4393939 0.03323091 13.222447 2.432912e-33
## gloves:seasonwinter -0.2893939 0.04281503 -6.759167 4.954192e-11
predict(fit3,new=cbind.data.frame(gloves=1,season="summer"), type="response")
## Warning in predict.lm(fit3, new = cbind.data.frame(gloves = 1, season =
## "summer"), : prediction from a rank-deficient fit may be misleading
##         1 
## 0.4393939
  1. Interactions and all main effects:
fit4 = lm(witch ~ season*gloves -1, data=x)
#summary(fit3)
print(summary(fit4)$coeff)
##                          Estimate Std. Error       t value    Pr(>|t|)
## seasonwinter        -3.161331e-16 0.08160406 -3.873987e-15 1.000000000
## seasonsummer         3.004133e-16 0.01812407  1.657538e-14 1.000000000
## gloves               1.500000e-01 0.08597522  1.744689e+00 0.081814826
## seasonsummer:gloves  2.893939e-01 0.09396856  3.079689e+00 0.002216654
predict(fit4,new=cbind.data.frame(gloves=1,season="summer"), type="response")
##         1 
## 0.4393939



t tests and classifiers









On March 10th, 2016 Keyvan asked the following question on researchgate:

“Is it true to say that if Two-sample t-test shows that two data are significantly different, then they are good feature for classification purposes?”

Alas, the answer is not necessarily (in fact, more likely not) ! Well,it all depends on what you expect from a “good” feature.

The problem is the aggregate nature of the t.test: it simply detects differences in the estimated means of the two populations. For large enough sample sizes, even with very heavy overlap in the two distributions the t test will be significant but the separation of the individual data points is poor!

Even though we are not in the hypothesis test setting, it is helpful to define the discriminative power of a feature by its respective type-I and type-II errors.

Remember, with any decision rule we can be wrong in two different ways:

  • Type I error: We reject the Null hypothesis, even though it is true. This error rate is denoted by \(\alpha\).
  • Type II error: We fail to reject the Null hypothesis, even though it is wrong. This error rate is denoted by \(\beta\).

A closely related concept is the power of a test. All three measures are visualized in the following two plots where we choose a symmetric classification rule (\(N=10^3\)):

I would expect us to conclude that the left panel shows a (very) good feature for classification purposes whereas the right panel not at all.

Would you be surprised to learn that in both cases the t tests are highly significant? Here it is for the left panel

## 
##  Welch Two Sample t-test
## 
## data:  s1 and s2
## t = -74.234, df = 1995.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.010219987 -0.009693888
## sample estimates:
##  mean of x  mean of y 
## 0.03995317 0.04991011

and for the right panel

## 
##  Welch Two Sample t-test
## 
## data:  s1 and s2
## t = -7.6674, df = 1997.2, p-value = 2.721e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.0013042349 -0.0007729414
## sample estimates:
##  mean of x  mean of y 
## 0.03982333 0.04086192

Now do not conclude that the t test gives useless or silly results. It simply answers a very different question from a classification purpose: “Could it be that the two population means are equal?”

This very nice post on the win-vector blog shows another example of the gap between prediction and aggregate inference.

And all of the above is in the best case of perfect normality! Of course, as Fabrice already pointed out, with strong deviations from normality, the results from the t.test are even less useful.