I learned about the Simpson’s paradox fairly recently, and I found it quite disturbing, not because of the mere “paradox” itself, but mainly because I felt it was something I should have known already.

In case you haven’t heard about it, one instance of the paradox is a real-world medical study for comparing the success rate of two treatments for kidney stones (from Wikipedia):

Overall, Treatment B is better because its success rate is 83%, compared to 78% of Treatment A. However, when they split the patients into 2 groups: those with small stones and those with large stones, then Treatment A is better than Treatment B in **both** subgroups. Paradoxical enough?

Well, it’s not. It turns out that for severe cases (large stones), doctors tend to give the more effective Treatment A, while for milder cases with small stones, they tend to give the inferior Treatment B. Therefore the sum is dominant by group 2 and group 3, while the other groups contribute little to the final sums. So the results can be interpreted more accurately as: when Treatment B is more frequently applied to less severe cases, it can appear to be more effective.

Now, knowing that *Treatment* and *Stone size* are not independent, this should not come up as a paradox. In fact, we can visualize the problem as a graphical model like this

All the numbers in the table above can be expressed as conditional probabilities like so:

- Group 1:
- Group 2:
- Group 3:
- Group 4:

For any of us who studied Probability, it is no surprise that the probabilities might turn up-side-down whenever some conditional variables are stripped out of the equations. In this particular case, since S depends on both St and T, the last 2 equations do not bring any new knowledge about S.

So what is this “paradox” about? Isn’t it nothing more than the problem of confounding/lurking variables, something that most people in Probability/Statistics already known? In this particular case, *Stone size* is the lurking variable that dictates both *Treatment* and *Success*, therefore the scientists who designed the experiment should have taken it into account since the beginning. It is well-known among Statistic practitioners that they must try their best to identify and eliminate the effect of any lurking variables in their experiments, or at least keep them fixed, before drawing any meaningful conclusion.

From a slightly different perspective, the paradox can be understood once we understand the human bias of drawing causal relations. Human, perhaps for the sake of survival, constantly look for causal relations and often tend to ignore rates or proportions. Once we conceived something as being causal (Treatment B gives higher success rate than Treatment A in general), which might be wrong, we continue to assume a causal relation and proceed with that assumption in mind. Obviously with this assumption, we will find the success rates for the subgroups of patients to be highly counter-intuitive, or even paradoxical.

In fact, the connection of this *paradox* to human intuitions is so important that Judea Pearl dedicated a whole section in his book for it. Modern Statistical textbooks and curriculum, however, don’t even mention it. Instead they will generally present the topic along with lurking/confounding variables.

Therefore, if you haven’t heard about this, it is probably for a good reason, or perhaps you are simply too young.

Nicely done. Causality can be a very tricky thing, as can simplistic summaries of policies.

Ha, I think I have a “theory” about simplistic summaries (or simple stories in general) 😉 I don’t know if anyone has mentioned it before, but here it goes anyway.

In general I think natural processes are all complicated. Things that can be summarized in nice-looking equations (i.e. have closed-form solutions) are rare and limited. In fact I was tempted to think that perhaps almost all those “simple” processes are already discovered, and we are left with messy, complicated, irregular processes.

Luckily we got some theories to deal with random processes too, but hell, our theories are way

simplified. Random processes that can be captured as mathematical models need some certain properties/approximations to be tractable, since our tools – be it computers or mathematical systems – are limited in their power. Unfortunately, Nature seems does not really care about whether her processes are tractable for human 😉

So, finding cases where intractable models become tractable is perhaps where most of interesting work will be carried.

Of course that is at very high level of abstraction. In Machine Learning, it manifests itself in several different ways, but a somewhat clear example is the “dominance” of probabilistic approaches (compared to rule-based systems), or strong/complicated function approximators like deep neural nets.

Human in general, is limited in our processing power too. It is impossible for a teacher to memorize the marks of all of her 100 students in the final exam, but instead she can compute the mean score and say: yea in average my class pass at 80. That is a very cruel approximation to the reality, but it is simple to communicate. Now she can extend that and say the marks has mean 80 and standard deviation of 10. It will give a clearer picture about the students, but it is a slightly more complicated model.

So it is all about finding a trade-off somewhere.

We also do that for most of other stuff we encounter in life. It is surely interesting to learn about policies of 20 different American presidents and draw some decent insights from that. But it would take a lot of time and effort. Most of us wouldn’t really care the minuscule details though, and will be happy moving forward with some form of cruel approximation.

The key is to acknowledge the fact that all summaries are approximations. Different people with different motivations/viewpoints are gonna take different summaries of the same facts, but that in general is not too big of a problem anyway. In fact I enjoy looking at the same facts from different viewpoints, e.g. looking at the same mathematical model from different approaches. That’s where most of my learning happens.

Another point is we as human have limited attention span. We can’t pay attention to all the stimulations happen around us. Therefore we have to pick which one to respond (or to build a fairly complicated model/summary for). For the rest, a cruel approximation is most of the time enough for us to get through the day.

Damn, I should’ve made this into a post. Couldn’t anticipate my 2-cent philosophy can take this long.