Tuesday, 21 June 2011

Bayesian vs Frequentist. A pragmatic bioinformatician's view

Ian Holmes, who I did my graduate studies together in Sanger (two biologists/coders in a room with happy house blearing out in the late 90s. Makes me feel old just thinking about it) was chastising me about my last post about statistical methods not having anything about Bayesian statistics.

Which, let's face it, was an oversight, because Bayesian statistics totally rocks in some scenarios in bioinformatics. At one level Bayesian statistics is straightforward (just write down as a probability model what you believe, sum over any uncertainity, eh voila, your answer), but at another level is profound; almost a way of life for some statisticians.

If you haven't met Bayesian statistics, it is really worth learning about it (I highly recommend the "Biological sequence analysis book" by Durbin, Eddy, Krogh, Mitchenson which introduces probabilistic modelling in sequence bioinformatics if you are a bioinformatician and used to sequences). The rather profound viewpoint of Bayesian statistics is that your confidence in believing something is true is related to the probability you think this will happen given a model. By either having alternative models, or by having different parameters inside of one model one can choose which model "is most likely" (ie, fits the observed data the best).

In contrast, the other view of statistics - so called "Frequentist" statistics - sets up the problem as a multiple trial problem with a null hypothesis. The null hypothesis is usually some interpretation of "random chance". By imaging the probability of a particular data point under the null hypothesis one either accepts (ie, it was reasonably likely to occur by chance) or rejects the null hypothesis; if you reject the null hypothesis there is normal an alternative which represents "means are different" or "correlation exists". The Frequentist viewpoint is to focus on modelling the "null hypothesis".

Better people than I have written about the difference between Bayesian and Frequentist view points - and have provided the arguments that these unify conceptually. Previously this was a almost religious schism in statistics, and now my feeling is there is more agreement that both perspectives are valid. People often get distracted by either the philosophical aspects of Bayesian statistics - for example, Bayesian statistics insists you provide a "prior belief" of your model before you see any data. This sounds like something you don't ask for on the Frequentist side, though Bayesians would argue that the choice of null and alternative hypotheses and in particular the distribution of the null hypothesis is basically all "prior belief". In Bayesian statistics there is a particularly neat business of being able to factor out variables which you know are important, but you don't know how to measure/estimate - in Bayesian statistics you can "just" sum over all possible values of that variable, weighted by your prior belief. (The "just" in the previous sentence hides often quite a bit of magic - doing the straightforward "sum" is often hard to impossible, and much of magic in Bayesian statistics is working out a way to do this sum in maths - ie, as a calculable integral - rather than explicitly. However, Bayesians can always fall back on Markov Chain Monte Carlo - which is in effect randomly sampling and walking around the space of possibilities. It's a bit like the frequentists approach of using permutation to get an effective null).

But the real benefit of Bayesian statistics is that it is fully compatible with models of the data. And those models can be as sophisticated as you like, as long as you can calculate the probability of observing a dataset, given the model (this is called the "likelihood"). In sequence based bioinformatics this is invaluable - all our understanding of sequence evolution on phylogenetic trees, or base substitutions, or codon behaviour, or exon/intron rules we can write down sensible models of this behaviour. If we then had to work out some "frequentist" based null model as background for this it is just plain... weird... to think about some null model, and almost impossible outside of generating "random" sequence (itself a rather non trivial thing to do right) and feeding it into metrics. Basically Bayesian statistics, due to the fact that it's totally at home with explicit models of the data, is a great fit to sequence analysis, and frequentist statistics just doesn't work as well.

In my PhD and the early days of running Ensembl, everything I did had to do with one or other aspect of sequence analysis. In particular Genewise, which is a principled combination of homology matching and gene prediction, was impossible to think about outside of Bayesian style statistics. Genewise was at the core of both Ensembl and in fact Celera genomics models of the human and mouse genome, and this rather venerable algorithm still happily chugs away at the core of a variety of genome annotation tools (including still some parts of Ensembl) although newer methods - such as Exonerate - work far faster with almost no loss in power.

This got me thinking about why I am not using Bayesian (or let's face it, it's poor cousin, maximum likelihood methods; most "Bayesian" methods actually don't do so much summing over uncertainity but rather take a maximum likelihood point. Whenever you get out an alignment, you are doing maximum likelihood) so much at the moment. And that's really because I've shifted into more open analysis of ENCODE data, where I don't trust my own internal models of what is going on - can anyone really "know" what the right model of a promoter or a enhancer is? Or whether even these words "enhancer" and "promoter" are the right things? And when we compare experimental results to predictions, I want to make the least number of assumptions and just ask the simple question - did my prediction process "predict" the outcome of the experiment? Even though sometimes those predictions are set up with a Bayesian style formalism, but often with a very naive model, I want to be as naive as possible in the next part of the test. So - I find myself most comfortable with the non-parameteric statistics (either a Chi-sq - category vs category, or Wilcoxon/Kruskal - category vs continuous, or Spearman's Rho - continuous vs continuous test). As a good Bayesian perhaps one can write down the "most naive models of association as possible" - itself something I don't quite know how to - and then do a likelihood/posterior calculation, but ... it feels like this is going to be the same as the non parametric statistics anyway. And it's far easier to say "Success in the Medaka fish enhancer assays were significantly associated with the prediction class of element (Chi-sq test, Pvalue, 0.0007)" rather than "We then set up a naive Bayesian model for the the association of the Medaka fish enhancer assays to the prediction class of elements, with flat prior beliefs of the association or not, and considering all possible proportions of association; The Bayes factor of 78.9 indicates that there is high confidence in the model with association" - not least because I think it would give the same result.

Perhaps in the future this area of research will come back to Bayesian statistics, but I think this will be because we are confident in our models. Once you want to have more sophisticated models, one is almost forced to be more Bayesian.

But, I should have had Bayesian statistics in my top 5 statistical techniques. Or said the top 6. Because they do totally rock in the right place.

So - you are right Ian :)