Sunday, 15 July 2012

Bunhill Fields

Carlisle Rainey and David Spiegelhalter, among others, have been chiding the press for sloppy reporting on the search for the Higgs Boson.  The context is the recent experiments using the Large Hadron Collider which seem to have identified a particle consistent with the expected properties of the Higgs Boson, with only a very small probability - one in 3.5 million - that the observed results would be seen if we were just observing random noise.  Rainey and Spiegelhalter's point is that this small probability is not at all the same as the probability that we have not discovered some new particle, or the Higgs Boson in particular. They cite sundry reports which fail to appreciate the difference.

Whereas I am not much concerned about the general grasp of the Standard Model of particle physics which predicts the existence of the Higgs Boson, I think it quite important that as many people as possible should understand the scientific notion of hypothesis testing.  So I'm going to make my own attempt to convey the message.

Suppose we have a hypothesis A which we wish to test.  We devise an experiment which will always give result B if A is true, but will sometimes give that result B if A is not true.  We can calculate the probability of getting B when A is not true, call it p.  We conduct the experiment and get result B.  What now is the probability that A is true?

The answer is that there's not enough information.  If we had been almost sure that A were true before we did the experiment, we'd be even more sure now.  Whereas if we'd thought it fantastically unlikely that A were true, we'd think it only somewhat less unlikely now.

The extra information we need is an estimate before doing the experiment of the probability that A is true.  Call that probability q.  Now we have the information we need.  Once we get result B, we are interested in the relative likelihood of getting the result because A is true - probability q - or because A is not true but the random numbers lined up that way - probability (1-q)p.  Our new estimate of the probability that A is true, in the light of the experimental result, is therefore / (+ (1-q)p ).  Note that if we were already certain that A was true - = 1 - then our new estimate is still 1.  And if we were already certain that A was false - = 0 - then our new estimate is still 0.

The important thing to grasp is that the probability in isolation that we get a particular result from random data is not the same as the probability in a given experiment that gave that result that the data were random.  The latter probability depends on what competing, non-random explanations are available and how likely they are to be true.

Let's take a concrete example.  We take a penny coin at random from the change we're given in a high-street shop.  We wish to test the hypothesis that his has heads on both sides.  So we toss it ten times and observe whether we get ten heads.  (Yes, I know, it would be easier just to look at both sides, but bear with me).   Suppose that we'd estimated the probability of having a double-headed penny as one in ten million.  The probability of tossing ten heads and no tails with a standard coin is one in 1024, which is about ten thousand times as probable as the double-headed penny, so our new estimate of the probability of having a double-headed penny is about one in ten thousand - the formula gives one in 9767 to the nearest whole number.  Although the likelihood in isolation of getting ten heads was small, we are forced by the result to believe something unlikely has happened, and we prefer the very unlikely explanation - ten heads out of ten by chance - to the extremely unlikely explanation - a double-headed penny.

Let's take another example: a proposed drug.  We employ a team of expert scientists to research into a particular biochemical pathway and to devise a drug to interfere with it in some desirable way.  We find that the drug works perfectly in vitro.  Then we test it in humans, using a carefully devised double-blind trial, and find that it outperforms a placebo to an extent we would expect to equal or exceed with another placebo one time in twenty (i.e a p-value of 0.05).  If our prior estimate of the probability of the drug's working was 40%, our new estimate will be 93%.

Alternatively, we might pick a naturally occurring substance by guesswork, dilute it out of existence with water, drip the water onto a sugar tablet, and allow the tablet to dry.  We observe no in vitro activity with this tablet beyond the effect of the sugar.  But suppose that when we conduct a similar double-blind trial we get a positive result meeting the same p-value of 0.05 .  If our prior estimate of the probability of the drug's working was one in a billion, our new estimate is one in 20 million.  Homeopathists might see this stark difference in interpretation of the data as unfair: a scientist would merely observe that the experiment wasn't nearly powerful enough to give useful support to such a fantastically unlikely hypothesis.  (In practice, drug investigators never publish their own estimates of prior probabilities: instead they give qualitative reasons why they thought the drug worth testing; the reader can form a view of their own.)

Back to the Higgs Boson.  Physicists generally tended to think that the Higgs Boson would be there in the energy range they were looking at, say with probability 75%.   If we ignore the possibility that they may have discovered a different particle in that range, and accept the announced value of the probability of getting the result from observing noise, then the probability now that the observed results are not due to the Higgs Boson are a bit less than one in 10 million.  We'd get a different estimate if we started from something else than 75%: the point is that only if we started from about 50% would we be able to say that the probability that the Higgs Boson has not been found is now one in three and a half million.

What I've been writing about is a simple application of Bayes' Theorem.  It's less well known than it might be that Thomas Bayes himself is buried at Bunhill Fields in the City of London (as are Daniel Defoe, William Blake, and John Bunyan, among others).  If you're a quant in the City and you're finding it difficult to think clearly about some question of probability, I recommend a walk there.  Not out of any mystical faith in the powers of the bones of long-dead nonconformists, but because it's there and the walk will do you good.


  1. Please explain this to me as a non-statistician... is this only true for Bayesian statistics (which I know about, but have never used) or also for "normal" statistics?

    E.g. If I have a population of wild animals in two different habitats. I want to find out if the observed "preference" observed for one habitat rather than the other is likely (probable) to be by chance alone (my Null hypothesis) or if they actually prefer that habitat. I use my observations and use a permutation test to randomly distribute the observations between two habitats 10 000 times. I see how many times the difference in density numbers between the simulated distributions is larger or equal to the observed difference in densities. If it happens less than 5% of the time (p < 0.05), I can safely say that the chance (probability) that the observed difference in density happened by chance alone (randomly) is less than 0.05 and that it is therefore likely that the animals actually have a preference for the habitat in which they occurred in greater density. I do not get what is wrong with this logic?

  2. This comment has been removed by the author.

  3. This comment has been removed by the author.

  4. I am not Plumbum, but I respond anyway. One of the main differences of classical statistics and bayesian is that the latter needs a starting point: a prior. The final estimation (called posterior) is based on both the prior plus the data of your experiment.

    A very 'sure' prior (say you are almost 100% sure that things are a certain way) plus not very convincing data (say: a small sample, with results all over the map) woulkd mean that the posterior will closely resemble the prior. In other words: the new evidence had had just a little influence. On the other hand: if your prior is very wide (meaning, you only have a vague idea how the reality is) and your data is very convincing (large sample, low variance) then your data will determine most of the posterior.

    Classical statistics does not use a prior. A null hypothesis is not a prior. Priors are the state of the art: our present knowlegde of the world before the new data come. It is a distribution. The null is most of the time a straw man. You know the null cannot be true. Also, the null is not a distribution but a point (say: effect = 0). You know that effect is not 0, since effects are rarely exactly null. The H1 can be undefined: "anything but null". So here we have a point hypohtesis (effect = 0) competing with all other outcomes. Now guess what hpothesis will win ;-)

    I understand your case as:
    H0 animals have on anaverage no preference for one place or another
    H1: animals do have on average some preference

    You rewrite this to:
    H0: the difference in animal density is 0
    H1: the difference in animal density is not 0

    These types of hypotheses are used a lot but not very strong. You let a difference of exactly 0 compete with all other differences. So H0 will lose.

    But let's get to the test. You do observations (say 10,000 data points) and you find that on average there is a difference of .... animals.

    For every obervation a statistic is calculated (say a T value) and you look at the distribution of the statistic. Very often this is a normal distribution. You found an "extreme" value that is is the tails of this distribution. The integral of this tail (or these tails, in the case of a 2 sided test) is your p-value.

    Now you can say: IF H0 were true, I would only find this type of result (or more extreme) in only 3% of all cases. That is a very unlikely outcome. Therefore, I assume that H0 is not true, I reject H0. And therefore I accept H1.

    You have calculated p = P(data | H0). What is the chance of finding this sort of results if H0 were true.

    This is NOT the same as P(H0| data ) = what is the chance of H0 being true, if I have found these data?

    To make this point clear, let's say an average woman is pregnant 2% of her life time. Now I look at my partner. She is pregnant. As a statistician, I think: what is the chance of finding a pregnant person, given that she is female?
    P (pregnant | female) = 0.02
    My partner is pregnant.
    This is a very unlikely state, if she were female.
    Conclusion: my partner is not female.

    My mistake is that I did not look at H1. In classical statistics you don't have to, and it's wrong. OK, the chance of a woman being pregnant is small, but the change of a man being pregnant is 0. So yes, H0 was unlikely, but H1 was impossible. Compared to H1, H0 is much, much, much more likely!

    You see that
    P (pregnant | female) = 0.02
    P (female | pregnant) = 1.00

    So, based on the chance of finding centain data, if H0 were true, you can say NOTHING about the chance that H0 is true.

    1. Woah, I think i understand this after trying to understand this for like three years. Great job 👋👋 !