Lying With Statistics

How to Lie with Statistics

Just for interest, we looked at some ways to deceive using statistics. The little book, "How to Lie With Statistics" was written about 1954. You can still get it today! It may be 50 years old, but the funny business that Darrell Huff described in the 50's is still going on today. The book is just as useful now as it was in 1954. Everyone ought to read it. It's listed in the auxiliary reading page. As part of your life education, read this neat little book. Keep it in your library. A number of class members write their first book reviews about it.

Deceptive Presentation

You are used to seeing graphs illustrating numbers. Done right, they are VERY helpful in interpreting data. Done in a fishy fashion, they can deceive. The first basic trick is to have the Y axis (vertical) start at some value above zero. This allows you to magnify (vertically) the display. Very small variations (with time, for example) can be made to look very large. The plot is legitimate as long as the axes are correctly labeled, although the reader who does not catch the labeling will be misled by the visual appearance even if the axes are labeled correctly! There's even one example in the book that has no labeling on the Y axis at all! Your guess is as good as anyone else's what it means.

There's another neat trick - displaying a one-dimensional value in two dimensions. Huff uses the example of moneybags, one twice as tall as the other. The picture is ostensibly representing one person having twice as much moola as another. The deception is visual; although the larger bag is correctly twice as tall as the smaller one, the AREA you see in the picture is 4 times larger! And if you think about the volume it is 8 times larger! Your eye will pick up the area or volume and deliver a misleading impression.

Semiattached Numbers

A semiattached number is a number which looks really interesting but actually is not as relevant as it appears. It looks good but means very little (or nothing).

Consider the claim that some product is "25% better." You ask "25% better than what?" Aaahh - you don't know that; the ad doesn't tell you. Some digging will be required to sort it out.

Now consider profit percentage. Would you rather run a business that turns a profit of 20% or one that returns 1%? You would, of course, choose the 20% return - unless you looked very closely at the claims. Suppose that the 20% return is an annual ROI (Return on Investment) and the 1% is a Return on Sales. they are not the same.

Suppose you are an enterprising kid selling newspapers. You buy the newspapers for 98 cents each, pay your selling costs and keep 1 cent per paper. That's the 1% return on sales. Sounds lousy? Imagine that you start a year with $98; that buys you 100 newspapers, which you sell for $1 each. With 100 newspapers you make 1 cent times 100, or $1 for the day. Doesn't sound good till you realize that you can do this 365 days per year. You make $365 per year. That's a total annual return of 372%!!! This is a result of inventory turnover.

Digression About Elections

Recall the difficulties with the election in 2000. This is a different problem. It is acknowledged that all vote-counting systems will lose some votes. Not many, but some. This means that any vote count has a small amount of uncertainty in it. This is not normally a problem, as the uncertainty amounts to a fraction of a percent. If the margin for some race is 55% to 45%, a fraction of a percent is of no consequence. What happened in Florida was that the margin between Bush and Gore was smaller than the uncertainty! The only thing to do was take the numbers from the official count and go with them. Three recounts could produce three different results. This was, fortunately, a rare occurrence. The only way to deal with it is to use vote counting systems that lose or confuse the smallest number of votes.

Distributions: Mean, Median, Mode

When talking about a group of values you hear of the "average" value. You might assume one thing but, on occasion, someone with an ulterior motive will use something different. The natural assumption is the arithmetic mean, which is the sum of the numbers divided by the count. There is another "average" that is sometimes used, namely the median. If you take all the values and arrange them in ascending order, then take the middle one, you have the median. It means that half of the values were lower and half were higher.

Example: Suppose you want to buy a home in a area where the average income is high (you might make some good connections among your neighbors). Your agent shows you a development in which the average income is over $1,000,000 annually. That sounds really good, so you buy. However, over time, your notice that few of your neighbors actually seem very wealthy. In fact, you can't seem to find anyone with an income over $100,000. Did the real estate agent deceive you?

Yes, in a way. It turns out that there is a tiny little cul-de-sac where 4 or 5 really highly paid executives live. They pull down $10,000,000 or more. The rest of the residents are in the $60,000 to $100,000 range. The few VERY LARGE incomes pulled the average WAY UP. You'd get a better picture of the income distribution by using the median, which might be $75,000 or so. A few large values skewing the distribution will drag the arithmetic average up but will not have much effect on the median.

This dynamic plays out globally, as well. For instance, the "average" (mean) household net worth in America is around $500,000 -- one of the highest levels of average wealth in the world. However, because America suffers from such high levels of wealth inequality, our *median* household net worth is only around $50,000 -- one of the lowest levels in the developed world! In fact, the ratio of mean to median net worth is one measure of wealth inequality, and by this metric the United States is one of the most unequal countries in the world.

There's one other parameter of a distribution worth mentioning - the mode. The mode is the peak of the distribution, or the value that occurs most often. If the number are distributed truly randomly (the classic bell curve), the average, median and mode will coincide. If the distribution is NOT bell-shaped,the three will separate.

Post Hoc Thinking

Darrell Huff, in his neat little book, notes the results of a survey of Cornell graduates (in the 1950's). The survey showed that 93% of the middle-aged male graduates were married but only 65 percent of the women were. One popular magazine writer quickly concluded that going to college seriously reduced a woman's chances of marriage. Or did it??

The correlation is real - the women did indeed marry at a lower rate. But - implying a causation is risky. Remember - correlation does not necessarily mean causation. Consider the following alternative explanation: the young women who go to Cornell are those who are more likely to delay marriage in favor of a career. A career-oriented woman would be more likely to attend a university and then head into a career than a marriage-oriented woman. The obvious correlation is a result of a single factor that is producing BOTH results.

Extrapolation

Extrapolation is an attempt to predict some phenomenon that lies outside the basis of experience. We looked at a table of record times for running the mile. Since 1913 there has been a steady downward trend. Prof. Scalise has plotted the times against the year, and from that we can see a roughly linear function. Now for the fun. We extrapolate and extend this linear function into the future and see that, in about 2050, someone will run the mile in zero seconds! Obviously, the extrapolation is not valid.

An extrapolation figured into analysis of the foam strike that resulted in the destruction of the shuttle Columbia in 2003. Data about the piece of foam that was observed to strike Columbia were fed into the "crater" model that NASA engineers used to evaluate the effect of foam strikes. Given the size of the piece and the impact velocity, the model would return a damage value. When parameters for the observed foam strike were fed to crater, it indicated that, while the damage would be significant, it was not a real hazard. This was an extrapolation, as the piece that hit Columbia was 400 times larger than any piece ever seen. Operating outside of the experience base, the model returned an incorrect estimate.

Simpson's Paradox

Sometimes two or more studies can individually support one conclusion, but the combined statistics support the opposite conclusion.

Non-transitive Paradox

If A is better than B and B is better than C, then how is A related to C? Surprisingly, A is not always better than C! Remember the kid's game Rock-Paper-Scissors: Rock breaks Scissors, Scissors cut Paper, but Paper covers Rock.

The Geometric Mean

Although Huff didn't mention this one, we added it for completeness. A good example of the use of a geometric mean is in figuring average return on investment. Suppose that, over a period of years, an investment returned 5%, 8%, 12%, 8%, -3%, 1% and 4%. What was the average annual return??

Here, you must MULTIPLY the returns, not add them.

1.05 * 1.08 * 1.12 * 1.08 * 0.97 * 1.01 * 1.04

and then take the 7th root of the resulting product.

Buy This Book

Everyone should read "How To Lie With Statistics" by Darrell Huff. The little insights it gives you may help you avoid being deceived.

Links to Related Stuff

How to Lie With Statistics full text.

Outline