Simpson's paradox arises in statistics when a trend appears in different groups of data (e.g., surveys) but disappears or reverses when these groups are combined.
Here a great baseball example (with actual data) of Simpson's Paradox from wikipedia
In case the "paradox" isn't clear--for three years David Justice had a higher batting average (hits divided by times at bat) than Derek Jeter. If the Yankees General Manager of the mid '90s was going to base their next contract on their hitting over these three years, he might say to himself:
"Hmm. Justice hit better three years in a row. He gets more money. "
Jeter might even concede the reasonableness of the decision.
It would be the wrong decision. Even though Jeter lost the competition every year--his cumulative average is higher. He deserves the bigger contract.
Again: Every year Justice had the higher average, but when the years are combined Jeter had the higher average. Most people find that counter-intuitive. (And those who don't, lie.)
This is a real problem is statistics. The medical profession, for example, must steadfastly guard against drawing false conclusions from different sets of data.
Mathematically it is clear why it happens:
Given
A1/B1 > C1/D1
and
A2/B2 > C2/D2
Does not imply
(A1 + A2)/(B1 + B2) > (C1 + C2)/(D1 + D2)
Although it seems as if it should!
No comments:
Post a Comment