Saturday, 9 June 2012

Thinking, Fast and Slow - Scientists are human too




I've just finished reading the excellent book "Thinking, Fast and Slow" by Daniel Kahneman, who is a psychologist who had a profound impact on economics; he won the Nobel Prize in economics in 2002 for "Prospect theory", which basically tries to provide a reliable model of observed human behaviour of choices, for example, up weighting low probability events, and in particular distinguishing scenarios which are gains vs losses - we are all loss-adverse, and so put more negative weight on losing something than positive weight on gaining something.

The book is great - not only are the multiple ideas he presents in the book important, he does this in a both scientific rigorous way and with considerable playfulness and humility - making the reader realise that he or she is just as prone to the same "mistakes" as anyone else, and describing some of his own behaviour in these terms, showing that he himself (just like basically every human...) has the same blind spots. I strongly recommend it.

Reading this book made me reflect on scientific practice - there are too many ideas to explore each one, but I'd like to focus here on his first theme - the difference between the fast-thinking, intuitive "System 1" personality and the slow-thinking, deliberative, work-it-out-with-pencil-and-paper "System 2" personality. System 1 is your "gut instinct" and actually makes most decisions in your life - it evaluates everything constantly, noting things which are surprising, and provides some explanation for anything that needs a decision - often focusing on previous experience, and relying a lot of the ability to create a narrative that fits the situation one sees. System 2 - the deliberate, conscious system, takes more effort (elegant studies could actually quantify this effort in a variety of ways) and is only activated when needed - most obviously for "clearly complex problems", such as writing down a non-trivial mathematical sum (what is 17 x 33? You can't do this via System 1, so you have to activate System 2 to solve this). System 2 also has a sort of lazy oversight of the options emerging from System 1, but- and this is the critical point - System 1 is incapable of distinguishing the times when its narrative explanations are truly a good fit for what is going on and when it's just winging it completely, often subtly shifting the proposed question into an answerable question. System 2 therefore only catches a small proportion of the errors thrown up by System 1, with a sort of cursory "does it feel right" and then moves on. How one poses (or, in the jargon - frames) questions is often the largest predictor of how people answer it. Again, Daniel walks through numerous fallacies - about betting, happiness, risk taking, trade offs - which everyone - including professionals, statisticians and scientists - make. It is humbling experience being forced to realise that you are actually as "irrational" as anyone else.

  Many people think of scientists as some of the more quantitative, rational people, and indeed the process of science prises rationality above perhaps anything else. But - as all scientists know - we are very human, and there is a considerable amount of on-the-hoof thinking; most obviously when we are (in effect) betting in which sets of experiments or analyses will give us the most insight on a problem, but if we're being honest, on lots of other things - on how good a piece of science is due to it's spoken or written presentation; on using the reaction of other scientists to help shape your own judgement; of following fashions in science. The power of narratives are particular strong in science. This reminds me of a recent(ish) paper where it got slaughtered on the first review, with the reviewers saying it was "too dense" and "just about statistics" with no impact. On the revision, for each graph we had showing a genome-wide trend, we also on the left picked out an example genomic location which was a sort of "ideal" situation showing this case. The change by the reviewers was striking, although in fact as the genome is such a large place, with a short Perl script you can find nearly any configuration of events you might want. These examples though established a narrative, and appealed to the readers "System 1" - then when we invited the System 2 to assess the P-value of the spearman's correlation (which were all significant to be fair to us) it had presented a worked out story to confirm, rather than having to construct the story itself.

   But - just as in personal life - the emphasis on narrative can be very misleading. The focus on creating clear cut stories encourages people to leave out inconsistent results or explore places that their models (or narratives) don't provide an answer. In many ways we have the process of competition between scientists in a field to counter act this, though there can be the danger of an entire field creating a self-sustaining narrative in which alternative scenarios are not explored. This sort of "narrative buy-in" is a feature probably of all human endeavours - be it businesses, financial markets and intelligence/government services - but science is not immune from it. This thinking has also lead me to understand the aged old idea of thesis-antithesis-synthesis. I always previously thought that this was sloppy thinking, and people accepting a compromise position (synthesis) whereas in science it was more likely that one approach is right and one is wrong. However, this System1, narrative based thinking suggests that most scientific positions contain both well thought through pieces, with a high amount of observations consistent with it in a narrative web that probably has either weak or often contradictory evidence. When two of these positions meet, the narratives might be inherently opposed (thesis and antithesis), but by examining their differences, one hopefully creates a new narrative (a synthesis) which preserves most of the hard, supported evidence in each thread. By having to reconcile viewpoints, scientists have to engage their "System 2" brains, and break things down to areas they really know and understand. It would be a fallacy to think that the resulting new narrative is perfect, but it will be better than the previous two. As human systems go, the fact that science values rational thought, backed up by observation, reproducible experiments and analysis means that I think we probably reach more rational understandings faster than many other fields, and this has helped me understand the importance of adversarial views (narratives) which may easily contain consistent components despite their apparent contradictions.   But we should not kid ourselves that Science is perfect and somehow free of this intuitive, loose, System 1 thinking.


  Indeed, thinking about System 1 and System 2 processes has made me reflect on organisational and community processes in Science. I've always implicitly known that bouncing ideas off people is good, and that meeting, even ones which seem to be just going through the motions to create an obvious answer are in fact worth it due to the deliberation. Interestingly we are all very bad at acknowledging that we make this (frankly) awful, irrational mistakes; our tendency when proved wrong is create a narrative about why our gut instinct got it wrong in this scenario, and not actually question the whole process. Much of what people describe as wisdom is really about being more deliberate and allowing more viewpoints to co-exist in apparent conflict, thus ensuring that one can make snap judgements. Reading this book has made me (when I think about it!) trust my gut instincts less, and appreciate the importance of process and deliberation.


  The other thing this book has brought out to me is that I am a clear human being - with all the loss-aversion, and narrative building fallacies that our brains have. As is nearly every other scientist. Acknowledging this is the first step I think of catching ourselves as individuals and our overall fields for making these "System 1" errors.

Friday, 1 June 2012

Data curation; the power of observation.






   Biology - like all sciences - is an observation based science, but perhaps more so than many others. Life is so diverse that the first task of any investigation is simply looking and recording biological phenomena. Very often even the simple process of observation will lead to a profound understanding of a biological component or system - perhaps more importantly it is observation and measurment which form the raw material for coming up with new hypotheses of how things work; usually these are then tested by perturbing the system in some experimental way, and repeating the observation or measurement (rarely one relies only on observation - I've blogged about this earlier). Much of the advances in biology came from the process of observing and cataloging, and then asking how to explain the catalog.


  Because of the mind-boggling detail of biology the set of observations these days usually have to be stored in some systematic computational way - be this supplementary material in a paper to highly structured, globally accessible databases (such as the suite run by the EBI). It is this electronic representation of data which is the dominant input for biology, though like every science we have to have narrative text that can explain both the details of the data, and how perhaps it fits relative to other datasets. The combination of the actual data itself, and the narrative explaining the data and how it might fit allow another scientist to build upon it, and so science progresses.


  Compared to physics in particular and to some extent, chemistry, in biology there is less explicit theory - certainly nothing of the precision of the standard model in physics, with its ability to predict with incredible accurracy whole hosts of behaviours (though not of course all - and the standard model itself does not "feel" complete is my understanding). There are some semi-strong theories in biology, such as evolution, and the behaviour of the coalescent in understanding populations of individuals, or the precise biochemistry of Kreb's cycle or the Calvin cycle, but even these rarely (if ever) have specific predictions of particular base pairs, precise chemical concentrations in living organisms or the details of individuals in a population; rather they provide a framework in which to understand collections of observations. (Incidently, probably the closest to a fully working, precise predictive model are the biochemical models either from classic biochemistry or a limited number of signalling molecules).

  Far more common is what one might call soft theories, or working models, in which one write down a model for the majority of a particular phenomena. Take for example, pre-mRNA splicing. One can write down a biochemical model of parts of splicing (such as the recognition of the 5' and 3' splice site, the presence of a lariat A base, the formation of the intron with a lariat). This model is used in three ways: firstly one can use it to reconcile one set of observations (cDNA sequences or RNA-seq data) against another (genome sequence). Secondly the process of reconcilation gives you insight into aspects of the mechanism which is otherwise unmeaserable - such as a good estimate of the biochemical recognition process at the splice sites, or constraints on how the definition of exons work. Thirdly the "unreconciled" components of the data, where the data refuse to fit given the model is the way to discover new phenomena. In this case, I really recommend reading the excellent work from Chris Burge who for me is both a innovative, great scientist and has taken this approach further than anyone else, and the EBI research groups are full of this sort of mentality - Janet Thornton about structure/function relationships, or Anton Enright about microRNA function ...

  Interestingly, the combination of data held together by these models is itself sometimes treated as a data point - a first class observation, although formally it's really an amalgam of data with model. I think of data as the bricks, and these models as the cement to build more complex structures.  So - for example - we are pretty confident about the amino acid sequences from cDNA sequences for most organisms, as we have a confident model of translation.  In fact direct measurement of amino acid sequences occassionally throws up some surprises (such as the discovery of selenocystines). Nevertheless the treatment of working model + data type A => data type B is a common transformation in biology, and much of what the EBI services do - for example Ensembl provides a transformation of genome sequence with many inputs to provide protein coding, non coding RNAs and regulatory information for a species.

  Now, life is so rich and diverse that you might think that the only scaleable way to handle this is to have increasingly cheap and accurate processes of observation (at multiple levels) and couple those with increasingly sophisticated working models (created by innovators such as Chris, and developed and engineered in places like the service teams at EBI), and the combination should provide a steadily improving understanding of life. This is all true, and what I believe, but it glosses over a third critical component - direct storage of direct experimental observations, faithful to the individual experiment, and - furthermore - the scientist based (rather than working model based) synthesis of data which again is directly stored - all of which we call data curation at the EBI.

   Indeed, it's been a bit of journey for me to understand this, as I definitely come from the large scale data gathering side of biology, and my research background is basically in working model creation, algorithm development and using these as tools to provide understanding (I still take this view point in my current research). 10 years ago I might have said that in the future we would only be doing specific (almost sampling) experiments to challenge such working models, and that of course discrepancies help one understand biology, but that large scale databases of data curation are not really the future. I now know better. It is of course impossible to curate everything, but for the most important things (like human protein coding gene structures, or the function of proteins) it is both (a) worth it in a hard-nosed value-for-money way and (b) completely necessary for us to gain more understanding of biology.


Both of these statements are worth unpacking. First the value for money one. In cases where a lot of research will go on - and this includes basically all of human in terms of the chain of information - genome, genes, transcripts, proteins, structures, functional understanding - every error (whether false positive or false negative) will lead to either mistaken conclusions or missed opportunities. As the reach of this data is global and pervasive, even small error rates get magnified to a large extent. It's important to realise that medical actions (for example, helping inform a clinical diagnosis using genetics) will have a level of scrutiny and lock down beyond the "research" level (and we have increasingly sophisticated mechanisms to capture this lock down, like the LRG project) but the effectiveness of transfer into the medical domain will be dependent on the accuracy of annotation. And it's worth noting that although the main driver is human (>60% of research, as judged either by papers or by web hits is human, and the next species in mouse), very often the place to curate the information is away from these species - this is in particular when one goes to gene/protein function or pathways, as other systems are far more tractable and then one projects this information onto human.

  The second one is more deep, and relates to my opening point - that the first process in biology is often collecting observations. The fact one has databases and computation and large scale facilities does not change the hard work of good observation. Take for example understanding pathways. We can create less sophisticated or more sophisticated models of a given pathway (boolean logic through to stochastic spatial aware models if you like), but no piece of modelling, theory or large scale data gathering will give you the framework of the pathway. These are the "hard miles" of molecular biology where individual dissection of particular process leads to both the mundane business of enumerating the players in a particular system and usually goes far further into providing some understanding of how it works. If you want to have a database which understands this, then one must simple be prepared to curate this information into a structure, using human brains to transform the current understanding in people's heads and in publications into some formal data structure (which is what the Reactome project does in this specific case). This story is repeated again in multiple domains - no theory will tell you how small molecules interact with proteins, but many people have done many experiments, and with good curation, you can build a database like chEMBL that collates these.


  So, at the EBI, there is a lot of data curation; in the UniProt team (along with colleagues in SIB and PIR) on protein function, in the Reactome team (along with colleagues in NYC and OICR) on pathways, chEMBL on small molecular/protein interactions, HGNC on gene names. And data curation is involved in nearly every aspect - the submissions to ENA, ArrayExpress, PDBe, Intact, PRIDE all have an interaction process with the submitter trying to capture as much of the information we know we can use for downstream use - a sort of "direct from the scientist" curation process. The Ensembl team has a very close relationship with direct curation of human, mouse and zebrafish gene structures from the Havana group. And even the distinction between "data curation" and "specific programming" becomes merged - much of what happens in the UniProt and Interpro groups is really a sort of biological aware specialised programming as curators create, validate and deploy specific automatic annotation schemes (such as UniRule), whereas the complexity to handle aspects of the Ensembl and Ensembl Genomes build in terms of input datasets and species specific rules and schemes is almost data curation from the opposite direction. At the end, all of this has to be focused on the end goal of providing useful information to practicing biologists and (increasingly) research clinicians.

   This is not say there are not very complex decisions about how to structure this data curation - what is worth capturing, and how efficient can you make this capture process. The presence of new datasets changes the game - both in scale, and in accuracy - the presence of new technologies, such as Wikis change the capture process. Every group needs to find the right balance across their different inputs to achieve their desired output, and staying ontop of the whole process and knowing when they hand off to another group. Data curation will therefore continue to evolve. But what is clear is that not only has data curation been a big part of bioinformatics, but it will remain so for the foreseeable future.