Code and Stats

In his brilliant book The Witches, Roald Dahl gives a probabilistic description of witch features:

How to Recognise a Witch

The next evening, after my grandmother had given me my bath, she took me once again into the living-room for another story. “Tonight,” the old woman said, “I am going to tell you how to recognise a witch when you see one.” “Can you always be sure?” I asked. “No,” she said, “you can’t. And that’s the trouble. But you can make a pretty good guess.” She was dropping cigar ash all over her lap, and I hoped she wasn’t going to catch on fire before she’d told me how to recognise a witch. “In the first place,” she said, “a REAL WITCH is certain always to be wearing gloves when you meet her.”

“Surely not always ,” I said. “What about in the summer when it’s hot?” “Even in the summer,” my grandmother said. “She has to. Do you want to know why?” “Why?” I said. “Because she doesn’t have fingernails. Instead of fingernails, she has thin curvy claws, like a cat, and she wears the gloves to hide them. Mind you, lots of very respectable women wear gloves, especially in winter, so this doesn’t help you very much.” “Mamma used to wear gloves,” I said. “Not in the house,” my grandmother said. “Witches wear gloves even in the house. They only take them off when they go to bed.”

And he goes on with wigs, scratching scalps, nose holes, their eyes, blue spit and the missing toes 🙂 Like an experienced data analyst he points out the power of pooling evidence:

“None of these things is any good on its own,” my grandmother said. “It’s only when you put them all together that they begin to make a little sense.

A splendid narration of this chapter can be found here:

I find the gloves a wonderful example for the natural absence of main effects with a pure interaction term.

This topic continues to receive a lot of attention in the statistical community, e.g.:

Including the interaction but not the main effects in a model.

I need some examples of data where there is an interaction and with/without main effects.

The general advice seems to be that “one should never include an interaction in a model without the corresponding main effects”

I find the gloves a wonderful example for the natural absence of main effects with a pure interaction term.

Let us simulate data with the following parameters: \(P(witch) = 0.1\)

\(P(glove | \neg witch \wedge summer) = 0.15\) (15% of normal women wear gloves in summer)

\(P(glove | \neg witch \wedge winter) = 0.9\) (90% of normal women wear gloves in winter)

\(P(glove | witch ) = 1\) (all witches wear gloves in all seasons)

So observing a gloved woman in summer or winter results in the following probabilities of being a witch:

\[P(witch | glove \wedge summer) = P(witch \wedge glove \wedge summer)/P(glove \wedge summer)\] \[= \frac{P(witch \wedge glove \wedge summer)}{(P(witch \wedge glove \wedge summer) + P(\neg witch \wedge glove \wedge summer)}\] \[=\frac{P(witch) \cdot P(summer | witch) \cdot P(glove | witch \wedge summer)}{P(witch) \cdot P(summer | witch) \cdot P(glove | witch \wedge summer) + P(\neg witch) \cdot P(summer | \neg witch) \cdot P(glove | \neg witch \wedge summer)}\] \[= \frac{0.1 \cdot 0.5 \cdot 1}{0.1 \cdot 0.5 \cdot 1+0.9 \cdot 0.5 \cdot 0.15} = 0.425\]

(Here we use the conditional independence \(P(summer | witch) = P(summer | \neg witch) = P(summer )\))

\[P(witch | glove \wedge winter) = P(witch \wedge glove \wedge winter)/P(glove \wedge winter)\] \[= \frac{P(witch \wedge glove \wedge winter)}{P(witch \wedge glove \wedge winter) + P(\neg witch \wedge glove \wedge winter)}\] \[= \frac{0.1 \cdot 0.5 \cdot 1}{0.1 \cdot 0.5 \cdot 1+0.9 \cdot 0.5 \cdot 0.85} = 0.116\]

Observing a gloved woman in any season:

\(P(witch | glove) =0.1/(0.1+0.9 \cdot 0.525)=0.1746\)

(The probability of a non-witch wearing gloves in any season is \((0.15+0.9)/2=0.525\))

#generate data:
#
set.seed(134)
N = 10000
x = cbind.data.frame(witch = sample(c(1,0),N,rep=TRUE,p=c(0.1,0.9)), season=factor(sample(c("summer", "winter"),N, rep=TRUE), levels= c("winter", "summer")), gloves = 0)
w = which(x$witch==1)
x[w,]$gloves = 1
# 15% of normal women wear gloves in summer
ii=which(x$witch==0 & x$season=="summer")
x[ii,]$gloves = sample(c(1,0),length(ii),rep=TRUE,p=c(0.15,0.85))
# 90% of normal women wear gloves in winter
ii=which(x$witch==0 & x$season=="winter")
x[ii,]$gloves = sample(c(1,0),length(ii),rep=TRUE,p=c(0.9,0.1))

The linear model is easier to interpret than the glm:

Main effects only:

fit1 = lm(witch ~ season + gloves -1, data=x)
#summary(fit1)
print(summary(fit1)$coeff)

##                 Estimate  Std. Error   t value      Pr(>|t|)
## seasonwinter -0.19760411 0.007953069 -24.84627 2.767367e-132
## seasonsummer  0.02372498 0.004327188   5.48277  4.289022e-08
## gloves        0.32785212 0.007613676  43.06095  0.000000e+00

predict(fit1,new=cbind.data.frame(gloves=1,season="summer"), type="response")

##         1 
## 0.3515771

Only Interactions

fit2 = lm(witch ~ season:gloves -1, data=x)
#summary(fit2)
print(summary(fit2)$coeff)

##                      Estimate  Std. Error  t value      Pr(>|t|)
## seasonwinter:gloves 0.1104267 0.004035316 27.36508 4.692681e-159
## seasonsummer:gloves 0.4266667 0.007854195 54.32341  0.000000e+00

predict(fit2,new=cbind.data.frame(gloves=1,season="summer"), type="response")

##         1 
## 0.4266667

# fit2a = lm(witch ~ season:gloves, data=x)
# #summary(fit2)
# print(summary(fit2a)$coeff)
# predict(fit2a,new=cbind.data.frame(gloves=1,season="summer"), type="response")

Interactions and one main effect:

fit3 = lm(witch ~ gloves:season + gloves-1, data=x)
#summary(fit3)
print(summary(fit3)$coeff)

##                       Estimate  Std. Error   t value      Pr(>|t|)
## gloves               0.4266667 0.007854195  54.32341  0.000000e+00
## gloves:seasonwinter -0.3162399 0.008830184 -35.81351 2.134732e-264

predict(fit3,new=cbind.data.frame(gloves=1,season="summer"), type="response")

## Warning in predict.lm(fit3, new = cbind.data.frame(gloves = 1, season =
## "summer"), : prediction from a rank-deficient fit may be misleading

##         1 
## 0.4266667

Interactions and all main effects:

fit4 = lm(witch ~ season*gloves -1, data=x)
#summary(fit3)
print(summary(fit4)$coeff)

##                          Estimate Std. Error       t value     Pr(>|t|)
## seasonwinter         1.665221e-14 0.01274246  1.306828e-12 1.000000e+00
## seasonsummer        -1.973059e-14 0.00441528 -4.468706e-12 1.000000e+00
## gloves               1.104267e-01 0.01336628  8.261594e+00 1.619632e-16
## seasonsummer:gloves  3.162399e-01 0.01611995  1.961792e+01 4.103302e-84

predict(fit4,new=cbind.data.frame(gloves=1,season="summer"), type="response")

##         1 
## 0.4266667

How does the tree look like?

Inferential tree first, then generative tree:

library(partykit)
ct = ctree(factor(witch) ~ season +gloves, data=x)
plot(ct)

ct = ctree(factor(gloves) ~ season +witch, data=x)
plot(ct)

Unbalanced Data (70% summer) AND low sample size:

Now, observing a gloved woman in summer or winter results in the following probabilities of being a witch:

\(P(witch | glove \wedge summer) = P(witch \wedge glove \wedge summer)/P(glove \wedge summer)\) \(= P(witch \wedge glove \wedge summer)/(P(witch \wedge glove \wedge summer) + P(\neg witch \wedge glove \wedge summer))\) \(= 0.1*0.7*1/(0.1*0.7*1+0.9*0.7*0.15) = 0.425\)

\(P(witch | glove \wedge winter) = P(witch \wedge glove \wedge winter)/P(glove \wedge winter)\) \(= P(witch \wedge glove \wedge winter)/(P(witch \wedge glove \wedge winter) + P(\neg witch \wedge glove \wedge winter))\) \(= 0.1*0.3*1/(0.1*0.3*1+0.9*0.3*0.85) = 0.116\)

Observing a gloved woman in any season:

\(P(witch | glove) =0.1/(0.1+0.9*0.375)=0.228\)

(The probability of a non-witch wearing gloves in any season is \((0.15*0.7+0.9*0.3)=0.375\))

#generate data:
#
set.seed(134)
N = 400
x = cbind.data.frame(witch = sample(c(1,0),N,rep=TRUE,p=c(0.1,0.9)), season=factor(sample(c("summer", "winter"),N, rep=TRUE, p = c(0.7,0.3)), levels= c("winter", "summer")), gloves = 0)
w = which(x$witch==1)
x[w,]$gloves = 1
# 15% of normal women wear gloves in summer
ii=which(x$witch==0 & x$season=="summer")
x[ii,]$gloves = sample(c(1,0),length(ii),rep=TRUE,p=c(0.15,0.85))
# 90% of normal women wear gloves in winter
ii=which(x$witch==0 & x$season=="winter")
x[ii,]$gloves = sample(c(1,0),length(ii),rep=TRUE,p=c(0.9,0.1))

Main effects only:

fit1 = lm(witch ~ season + gloves -1, data=x)
#summary(fit1)
print(summary(fit1)$coeff)

##                 Estimate Std. Error    t value     Pr(>|t|)
## seasonwinter -0.21824679 0.04089194 -5.3371595 1.591126e-07
## seasonsummer  0.01076554 0.01797279  0.5989908 5.495207e-01
## gloves        0.39225394 0.03506834 11.1854147 1.961967e-25

predict(fit1,new=cbind.data.frame(gloves=1,season="summer"), type="response")

##         1 
## 0.4030195

Only Interactions

fit2 = lm(witch ~ season:gloves -1, data=x)
#summary(fit2)
print(summary(fit2)$coeff)

##                      Estimate Std. Error  t value     Pr(>|t|)
## seasonwinter:gloves 0.1500000 0.02699692  5.55619 5.055488e-08
## seasonsummer:gloves 0.4393939 0.03323091 13.22245 2.432912e-33

predict(fit2,new=cbind.data.frame(gloves=1,season="summer"), type="response")

##         1 
## 0.4393939

# fit2a = lm(witch ~ season:gloves, data=x)
# #summary(fit2)
# print(summary(fit2a)$coeff)
# predict(fit2a,new=cbind.data.frame(gloves=1,season="summer"), type="response")

Interactions and one main effect:

fit3 = lm(witch ~ gloves:season + gloves -1, data=x)
#summary(fit3)
print(summary(fit3)$coeff)

##                       Estimate Std. Error   t value     Pr(>|t|)
## gloves               0.4393939 0.03323091 13.222447 2.432912e-33
## gloves:seasonwinter -0.2893939 0.04281503 -6.759167 4.954192e-11

predict(fit3,new=cbind.data.frame(gloves=1,season="summer"), type="response")

## Warning in predict.lm(fit3, new = cbind.data.frame(gloves = 1, season =
## "summer"), : prediction from a rank-deficient fit may be misleading

##         1 
## 0.4393939

Interactions and all main effects:

fit4 = lm(witch ~ season*gloves -1, data=x)
#summary(fit3)
print(summary(fit4)$coeff)

##                          Estimate Std. Error       t value    Pr(>|t|)
## seasonwinter        -3.161331e-16 0.08160406 -3.873987e-15 1.000000000
## seasonsummer         3.004133e-16 0.01812407  1.657538e-14 1.000000000
## gloves               1.500000e-01 0.08597522  1.744689e+00 0.081814826
## seasonsummer:gloves  2.893939e-01 0.09396856  3.079689e+00 0.002216654

predict(fit4,new=cbind.data.frame(gloves=1,season="summer"), type="response")

##         1 
## 0.4393939

Code and Stats

Pure Interaction Terms: The Witches

How does the tree look like?

Unbalanced Data (70% summer) AND low sample size:

Spatial Data in R

t tests and classifiers

“Is it true to say that if Two-sample t-test shows that two data are significantly different, then they are good feature for classification purposes?”