Friday, 1 June 2012
Data curation; the power of observation.
Biology - like all sciences - is an observation based science, but perhaps more so than many others. Life is so diverse that the first task of any investigation is simply looking and recording biological phenomena. Very often even the simple process of observation will lead to a profound understanding of a biological component or system - perhaps more importantly it is observation and measurment which form the raw material for coming up with new hypotheses of how things work; usually these are then tested by perturbing the system in some experimental way, and repeating the observation or measurement (rarely one relies only on observation - I've blogged about this earlier). Much of the advances in biology came from the process of observing and cataloging, and then asking how to explain the catalog.
Because of the mind-boggling detail of biology the set of observations these days usually have to be stored in some systematic computational way - be this supplementary material in a paper to highly structured, globally accessible databases (such as the suite run by the EBI). It is this electronic representation of data which is the dominant input for biology, though like every science we have to have narrative text that can explain both the details of the data, and how perhaps it fits relative to other datasets. The combination of the actual data itself, and the narrative explaining the data and how it might fit allow another scientist to build upon it, and so science progresses.
Compared to physics in particular and to some extent, chemistry, in biology there is less explicit theory - certainly nothing of the precision of the standard model in physics, with its ability to predict with incredible accurracy whole hosts of behaviours (though not of course all - and the standard model itself does not "feel" complete is my understanding). There are some semi-strong theories in biology, such as evolution, and the behaviour of the coalescent in understanding populations of individuals, or the precise biochemistry of Kreb's cycle or the Calvin cycle, but even these rarely (if ever) have specific predictions of particular base pairs, precise chemical concentrations in living organisms or the details of individuals in a population; rather they provide a framework in which to understand collections of observations. (Incidently, probably the closest to a fully working, precise predictive model are the biochemical models either from classic biochemistry or a limited number of signalling molecules).
Far more common is what one might call soft theories, or working models, in which one write down a model for the majority of a particular phenomena. Take for example, pre-mRNA splicing. One can write down a biochemical model of parts of splicing (such as the recognition of the 5' and 3' splice site, the presence of a lariat A base, the formation of the intron with a lariat). This model is used in three ways: firstly one can use it to reconcile one set of observations (cDNA sequences or RNA-seq data) against another (genome sequence). Secondly the process of reconcilation gives you insight into aspects of the mechanism which is otherwise unmeaserable - such as a good estimate of the biochemical recognition process at the splice sites, or constraints on how the definition of exons work. Thirdly the "unreconciled" components of the data, where the data refuse to fit given the model is the way to discover new phenomena. In this case, I really recommend reading the excellent work from Chris Burge who for me is both a innovative, great scientist and has taken this approach further than anyone else, and the EBI research groups are full of this sort of mentality - Janet Thornton about structure/function relationships, or Anton Enright about microRNA function ...
Interestingly, the combination of data held together by these models is itself sometimes treated as a data point - a first class observation, although formally it's really an amalgam of data with model. I think of data as the bricks, and these models as the cement to build more complex structures. So - for example - we are pretty confident about the amino acid sequences from cDNA sequences for most organisms, as we have a confident model of translation. In fact direct measurement of amino acid sequences occassionally throws up some surprises (such as the discovery of selenocystines). Nevertheless the treatment of working model + data type A => data type B is a common transformation in biology, and much of what the EBI services do - for example Ensembl provides a transformation of genome sequence with many inputs to provide protein coding, non coding RNAs and regulatory information for a species.
Now, life is so rich and diverse that you might think that the only scaleable way to handle this is to have increasingly cheap and accurate processes of observation (at multiple levels) and couple those with increasingly sophisticated working models (created by innovators such as Chris, and developed and engineered in places like the service teams at EBI), and the combination should provide a steadily improving understanding of life. This is all true, and what I believe, but it glosses over a third critical component - direct storage of direct experimental observations, faithful to the individual experiment, and - furthermore - the scientist based (rather than working model based) synthesis of data which again is directly stored - all of which we call data curation at the EBI.
Indeed, it's been a bit of journey for me to understand this, as I definitely come from the large scale data gathering side of biology, and my research background is basically in working model creation, algorithm development and using these as tools to provide understanding (I still take this view point in my current research). 10 years ago I might have said that in the future we would only be doing specific (almost sampling) experiments to challenge such working models, and that of course discrepancies help one understand biology, but that large scale databases of data curation are not really the future. I now know better. It is of course impossible to curate everything, but for the most important things (like human protein coding gene structures, or the function of proteins) it is both (a) worth it in a hard-nosed value-for-money way and (b) completely necessary for us to gain more understanding of biology.
Both of these statements are worth unpacking. First the value for money one. In cases where a lot of research will go on - and this includes basically all of human in terms of the chain of information - genome, genes, transcripts, proteins, structures, functional understanding - every error (whether false positive or false negative) will lead to either mistaken conclusions or missed opportunities. As the reach of this data is global and pervasive, even small error rates get magnified to a large extent. It's important to realise that medical actions (for example, helping inform a clinical diagnosis using genetics) will have a level of scrutiny and lock down beyond the "research" level (and we have increasingly sophisticated mechanisms to capture this lock down, like the LRG project) but the effectiveness of transfer into the medical domain will be dependent on the accuracy of annotation. And it's worth noting that although the main driver is human (>60% of research, as judged either by papers or by web hits is human, and the next species in mouse), very often the place to curate the information is away from these species - this is in particular when one goes to gene/protein function or pathways, as other systems are far more tractable and then one projects this information onto human.
The second one is more deep, and relates to my opening point - that the first process in biology is often collecting observations. The fact one has databases and computation and large scale facilities does not change the hard work of good observation. Take for example understanding pathways. We can create less sophisticated or more sophisticated models of a given pathway (boolean logic through to stochastic spatial aware models if you like), but no piece of modelling, theory or large scale data gathering will give you the framework of the pathway. These are the "hard miles" of molecular biology where individual dissection of particular process leads to both the mundane business of enumerating the players in a particular system and usually goes far further into providing some understanding of how it works. If you want to have a database which understands this, then one must simple be prepared to curate this information into a structure, using human brains to transform the current understanding in people's heads and in publications into some formal data structure (which is what the Reactome project does in this specific case). This story is repeated again in multiple domains - no theory will tell you how small molecules interact with proteins, but many people have done many experiments, and with good curation, you can build a database like chEMBL that collates these.
So, at the EBI, there is a lot of data curation; in the UniProt team (along with colleagues in SIB and PIR) on protein function, in the Reactome team (along with colleagues in NYC and OICR) on pathways, chEMBL on small molecular/protein interactions, HGNC on gene names. And data curation is involved in nearly every aspect - the submissions to ENA, ArrayExpress, PDBe, Intact, PRIDE all have an interaction process with the submitter trying to capture as much of the information we know we can use for downstream use - a sort of "direct from the scientist" curation process. The Ensembl team has a very close relationship with direct curation of human, mouse and zebrafish gene structures from the Havana group. And even the distinction between "data curation" and "specific programming" becomes merged - much of what happens in the UniProt and Interpro groups is really a sort of biological aware specialised programming as curators create, validate and deploy specific automatic annotation schemes (such as UniRule), whereas the complexity to handle aspects of the Ensembl and Ensembl Genomes build in terms of input datasets and species specific rules and schemes is almost data curation from the opposite direction. At the end, all of this has to be focused on the end goal of providing useful information to practicing biologists and (increasingly) research clinicians.
This is not say there are not very complex decisions about how to structure this data curation - what is worth capturing, and how efficient can you make this capture process. The presence of new datasets changes the game - both in scale, and in accuracy - the presence of new technologies, such as Wikis change the capture process. Every group needs to find the right balance across their different inputs to achieve their desired output, and staying ontop of the whole process and knowing when they hand off to another group. Data curation will therefore continue to evolve. But what is clear is that not only has data curation been a big part of bioinformatics, but it will remain so for the foreseeable future.