Lying With Statistics
How to Lie with Statistics
Just for interest, we looked at some ways to deceive using statistics. The little book, "How to Lie With Statistics" was written about 1954. You can still get it today! It may be 50 years old, but the funny business that Darrell Huff described in the 50's is still going on today. The book is just as useful now as it was in 1954. Everyone ought to read it. It's listed in the auxiliary reading page. A number of class members write their first book reviews about it.
You are used to seeing graphs illustrating numbers. Done right, they are VERY helpful in interpreting data. Done in a fishy fashion, they can deceive. The first basic trick is to have the Y axis (vertical) start at some value above zero. This allows you to magnify (vertically) the display. Very small variations (with time, for example) can be made to look very large. The plot is legitimate as long as the axes are correctly labeled, although the reader who does not catch the labeling will be misled by the visual appearance even if the axes are labeled correctly! There's even one example in the book that has no labeling on the Y axis at all! Your guess is as good as anyone else's what it means.
There's another neat trick - displaying a one-dimensional value in two dimensions. Huff uses the example of moneybags, one twice as tall as the other. The picture is ostensibly representing one person having twice as much moola as another. The deception is visual; although the larger bag is correctly twice as tall as the smaller one, the AREA you see in the picture is 4 times larger! And if you think about the volume it is 8 times larger! Your eye will pick up the area or volume and deliver a misleading impression.
As part of your life education, read this neat little book. Keep it in your library.
A semiattached number is a number which looks really interesting but actually is not as relevant as it appears. It looks good but means very little (or nothing).
Consider the claim that some product is "25% better." You ask "25% better than what?" Aaahh - you don't know that; the ad doesn't tell you. Some digging will be required to sort it out.
Now consider profit percentage. Would you rather run a business that turns a profit of 20% or one that returns 1%? You would, of course, choose the 20% return - unless you looked very closely at the claims. Suppose that the 20% return is an annual ROI (Return on Investment) and the 1% is a Return on Sales. they are not the same.
Suppose you are an enterprising kid selling newspapers. You buy the newspapers for 98 cents each, pay your selling costs and keep 1 cent per paper. That's the 1% return on sales. Sounds lousy? Imagine that you start a year with $98; that buys you 100 newspapers, which you sell for $1 each. With 100 newspapers you make 1 cent times 100, or $1 for the day. Doesn't sound good till you realize that you can do this 365 days per year. You make $365 per year. That's a total annual return of 372%!!! This is a result of inventory turnover.
Digression About Elections
Recall the difficulties with the election in 2000. This is a different problem. It is acknowledged that all vote-counting systems will lose some votes. Not many, but some. This means that any vote count has a small amount of uncertainty in it. This is not normally a problem, as the uncertainty amounts to a fraction of a percent. If the margin for some race is 55% to 45%, a fraction of a percent is of no consequence. What happened in Florida was that the margin between Bush and Gore was smaller than the uncertainty! The only thing to do was take the numbers from the official count and go with them. Three recounts could produce three different results. This was, fortunately, a rare occurrence. The only way to deal with it is to use vote counting systems that lose or confuse the smallest number of votes.
Proving That a Coin is Biased
We will do something really interesting - prove that a coin is biased! It's easy to do. Everyone in the class gets out a coin, a pen and paper. Everyone then flips their coin ten times and counts the heads and tails. We then count who got what combinations.
The results were interesting; we found several biased coins. Look at the results of the flipping.
H T Count 0 10 0 1 9 0 2 8 0 3 7 0 4 6 0 5 5 0 6 4 0 7 3 0 8 2 0 9 1 0 10 0 0
One person got 8 tails and 2 got 8 heads! Biased coins for sure! Or are they??
What have we done? We've ignored the mass of data which indicates that coin flipping produces a binomial distribution; notice that the mode is 6 heads and 4 tails (14 occurrences). That's in the range of chance. The 8/2, 2/8 and 10/0 results are in the tails of the distribution - infrequent, but EXACTLY what you expect. By cherry-picking one or two samples from the larger mass of data we can "show" that a coin is biased.
When talking about a group of values you hear of the "average" value. You might assume one thing but, on occasion, someone with an ulterior motive will use something different. The natural assumption is the arithmetic mean, which is the sum of the numbers divided by the count. There is another "average" that is sometimes used, namely the median. If you take all the values and arrange them in ascending order, then take the middle one, you have the median. It means that half of the values were lower and half were higher.
Example: Suppose you want to buy a home in a area where the average income is high. You might make some good connections among your neighbors. You find a nice development and Rhonda RealEstate tells you that the average income is over $1,000,000 annually. That sounds really good, so you buy.
Time passes. You notice that, although the average income is supposed to be over $1,000,000, your neighbors don't quite fall into that category. In fact, their incomes, while good, are nothing extraordinary. They're up to $100,000 or so, but no really high ones (that you can find, anyway). Did the real estate agent deceive you? Yes, in a way. It turns out that there is a tiny little cul-de-sac where 4 or 5 really highly paid executives live, over in the high-rent corner of the development. They pull down $2,000,000 or more. The rest of the residents are in the $60,000 to $100,000 range. So what happened? The few VERY LARGE incomes pulled the average WAY UP. You'd get a better picture of the income distribution by using the median. You might find that the median is $75,000 or so. A few large values will skew the distribution and drag the arithmetic average up but will not have much effect on the median.
There's one other parameter of a distribution worth mentioning - the mode. The mode is the peak of the distribution, or the value that occurs most often. If the number are distributed truly randomly (the classic bell curve), the average, median and mode will coincide. If the distribution is NOT bell-shaped,the three will separate.
Post Hoc Thinking
Darrell Huff, in his neat little book, notes the results of a survey of Cornell graduates (in the 1950's). The survey showed that 93% of the middle-aged male graduates were married but only 65 percent of the women were. One popular magazine writer quickly concluded that going to college seriously reduced a woman's chances of marriage. Or did it??
The correlation is real - the women did indeed marry at a lower rate. But - implying a causation is risky. Remember - correlation does not necessarily mean causation. Consider the following alternative explanation: the young women who go to Cornell are those who are more likely to delay marriage in favor of a career. A career-oriented woman would be more likely to attend a university and then head into a career than a marriage-oriented woman. The obvious correlation is a result of a single factor that is producing BOTH results.
Suppose you tossed a coin and got 8 consecutive tails. What would be the probability of getting a tail the next time? Answer: 0.5. Just as in all the other tosses. The fact that you have gotten 8 tails in a row does NOT mean that you are "due" for a head. Before you start tossing the coin, however, the probability of getting 8 tails in a row is VERY small. It won't happen very often. Once it HAS happened, though, the probability of a ninth tail is still 0.5.
Same thing goes for baseball. Baseball?? Have you ever heard the announcer say that the batter has struck out 10 times in a row and is "due" for a hit? Same fallacy. Each at-bat is an independent trial; previous attempts have no influence on the current one. The probability of a hit is just the same as before. If this batter is striking out a lot, maybe the batting coach had better get busy!
Extrapolation is an attempt to predict some phenomenon that lies outside the basis of experience. We looked at a table of record times for running the mile. Since 1913 there has been a steady downward trend. Prof. Scalise has plotted the times against the year, and from that we can see a roughly linear function. Now for the fun. We extrapolate and extend this linear function into the future and see that, in about 2500, someone will run the mile in zero seconds! Obviously, the extrapolation is not valid.
An extrapolation figured into analysis of the foam strike that resulted in the destruction of the shuttle Columbia in 2003. Data about the piece of foam that was observed to strike Columbia were fed into the "crater" model that NASA engineers used to evaluate the effect of foam strikes. Given the size of the piece and the impact velocity, the model would return a damage value. When parameters for the observed foam strike were fed to crater, it indicated that, while the damage would be significant, it was not a real hazard. This was an extrapolation, as the piece that hit Columbia was 400 times larger than any piece ever seen. Operating outside of the experience base, the model returned an incorrect estimate.
The Geometric Mean
Although Huff didn't mention this one, we added it for completeness. A good example of the use of a geometric mean is in figuring average return on investment. Suppose that, over a period of years, an investment returned 5%, 8%, 12%, 8%, -3%, 1% and 4%. What was the average annual return??
Here, you must MULTIPLY the returns, not add them.
1.05 * 1.08 * 1.12 * 1.08 * 0.97 * 1.01 * 1.04
and then take the 7th root of the resulting product.
Buy This Book
Everyone should read "How To Lie With Statistics" by Darrell Huff. The little insights it gives you may help you avoid being deceived.
Links to Related Stuff
- U. Conn. notes