Tuesday, October 29, 2013

T'row Da Bumps Out! --Error Bars, Fluctuation, Don't let your eyes deceive you.

When particle physicits wanted to claim that they had found the Higgs particle at the Large Hadron Collider at CERN in Geneva, Switzerland, they had to find a 5-sigma signal before it would be credible. Moreover, two different experiments had found the same result at the same mass (the mass being the "first name" of the particle).  Moreover the particle needed to have a particular angular momentum (spin) and mirror symmetry (parity), and that could only be ascertained by studying particular modes of decay, and those facts took lots more experimental data to ascertain.

Put differently, just because you see a bump, or just because you see a trend, does not mean it is significant and real. It might just be a fluctuation.

When we make claims in public policy or social science, about society, that are empirically grounded, we'll rarely get 5-sigma quality (too few observations, too little theory, too little precision). But, in general, you want to be assured that the claims make sense. Hence you must always attach error bars to your points or claims, where the bars might be 1-sigma plus or minus. Moreover, if you are claiming a trend or a shape, you need to fit the data to see if constancy and a straight line are reasonable zeroth-order assumptions. And if you are making a claim about when something began or the like, there are subtle tests of such in the statistical literature.

Moreover, Bayesian ideas should be on your mind. Even if you have rough measures and not so ideal statistics, can your measurements be seen in the light of what we take as priors and used to revise them. Often, in the policy arena, poor data may still allow you to improve practice, albeit not with the assurance you would like, but at least now you are doing better than without any data and only your presumptions and priors.

Also, never draw a line connecting points unless it is a "fit" to the data. Surely in the case of railroads you can link stations with lines since you know that trains go from A to B to C to...  And even here they may not follow straight lines between stations. However, in studying time dependent data, your straight lines presume trends when what you may have is random fluctuation.

Finally, if you want to claim changes from one time to another, be sure to normalize those changes by the standard deviations of the data, so that, again, fluctuations are more apparent. And if you plot the data and you have data that begins at say zero, you do not just show say from .5 to .6, but present it as 0 to .7, or if not put a zig-zag on the y-axis to indicate that you are skipping lots of y-axis—that is, the-y-axis begins at zero, you put in a zig-zag at say 0.1 and resume at 0.4 in the above case.


What motivated the above: A propos of yesterday’s seminar on economic conditions and social capital, I wrote this post. I enjoyed the talk, and unusually for me, I was not so much concerned with what was the punchline. It seemed clear—to provide some evidence about a common belief. Jenny Schuetz asked incisive questions about causation. The speaker responded that he was trying to find out the facts of the situation, and the connection seemed to be causal given the time frames and some of trends in the disaggregated data. I woke up this morning thinking some more. None of what I say here diminishes my interest in the seminar, but all of these things are needed to calm various objections, none of which are necessarily fatal but all of which need to be dealt with. A seminar may not be the place to make sure all is perfect, but there is no reason to leave out obvious practices even if it is “just” a talk. You want people to concentrate on your substance, not go crazy over your statistics.