In describing what the EBI does, it is
sometimes hard to provide a feel for the complexity and detail of our
processes. Recently I have been using an analogy of EBI as a “data refinery”: it takes in raw data
streams ("feedstocks"), combines and refines them, and transforms
them into multi-use outputs ("products"). I see it as a pristine,
state-of-the-art refinery with large, imposing central tanks (perhaps with
steam venting here and there for effect) from which massive pipes emerge, covered
in reflective cladding and connected in refreshingly sensible ways. In the
foreground are arrayed a series of smaller tanks and systems, interconnected in
more complex ways. Surrounding the whole are workshops, and if you look close
enough you can see a group of workers decommissioning one part of the system
whilst another builds a new one.
| Oil refinery on the north end of March Point; Mount Baker |
I find this analogy useful for a number of
reasons. First, a "product" is often itself a feedstock, which is why
the EBI has so many complex cycles of information. For example, InterPro member
database models and patterns are feedstocks for the InterPro entries; during
refinement they become associated with one another, documentation and gene
ontology (GO) assignments. InterPro takes in UniProt (UniParc) protein sequences
and combines them with models to provide boundaries on proteins; these in turn
allow the ‘InterPro2GO’ GO assignment process to occur. This automatic GO
annotation is then applied to the UniProtKB entries along with experimentally
defined GO annotations which come from GO curators worldwide, and include many
entries about model organisms .The InterPro entries additionally provide raw
information (feedstock) for the UniRule automatic annotation, where InterPro
matches are the mainstay condition of a
particular rule, which the UniProt curator combines with other conditions such
taxonomic restrictions and sequence properties , ensuring the most accurate
application of the annotation extracted
from the experimentally proven UniProtKB entries to the proteins of unknown
function.
This is a complex network of inputs and
outputs, (just writing it down and trying to keep it all straight is exhausting
unless you are part of it – I went through a couple of rounds with Claire
O’Donovan and Sarah Hunter to get the above flow absolutely straight) but the
main input – bare protein sequences (coming from internal feedstocks including
ENA and Ensembl) –is being converted into the main output: annotated protein
entries, with human-readable annotation and careful audit trail of its
'refinement'. This is what the user sees as the output of the refinery, and
understandably does not want to spend too much time worrying about the details
of pipe connectivity inside the refinery.
Another reason I find the refinery analogy
useful is because volume can be deceptive. The biggest, most impressive tanks
in this refinery are filled with DNA sequence data but for the refinery to work
as a whole it needs many "specialist" chemicals, in lower volumes, to
serve as critical catalytic components. It might be necessary for the refinery
to make and store some components in order to streamline a more complex flow of
information. The EBI works with key "catalyst" streams of information
that have a disproportionate impact relative to their volume (e.g. this
assignment of experimentally defined annotation).
A deceptive view of this refinery would
focus exclusively on the final outputs and the most recent refinement process,
without taking in the intricate web of components behind them. People might use
Reactome or IntAct to understand a particular functional dataset, but the
protein information in these resources depends on UniProt to track and annotate
these sequences. The protein information in UniProtKB is dependent on the ENA
database smoothly accepting submissions with annotated CDS proteins present. In
this way, asking to visualise, say, phosphoprotein results on a pathway diagram
is not as simple as it might seem. It implicitly draws on many of the tanks in
the EBI refinery. This larger network actually goes beyond the EBI's borders to its
worldwide collaborators (e.g. wwPDB, or the INSDC’s GenBank/ENA/DDBJ).
The final "product" that the user
sees often has a local manufacturer (i.e., bioinformatician/computational
biologist) who pulls in information from the large tanks and combines it with
local data to provide an overall picture and give context. Often, the research
group querying EBI data does not worry too much about the details of how the
refinery works, or about the complex inter-dependencies of the refinery; they
just want easy access to a product they can rely on. It is the job of the EBI,
and in future will be the job of ELIXIR, to satisfy this desire.
A refinery does not stay still. In each
process, engineers (in our case bioinformaticians and software engineers) work
to improve minor, everyday things and to carry out major retooling. New types of
experimental information might require a new tank and pipelines, or become
cheap enough to replace older feedstocks, in both cases opening up potential
for new, useful products. New discoveries might change the way processes or
transformations are handled, perhaps by adding a certain catalyst at a
particular stage to improve the products.
Clearly the EBI is not the only refinery. Our
European partners, such as SIB and Sanger, collaborate so closely with us on
key projects that it’s hard to work out where one refinery stops and the other begins.
We exchange data and expertise regularly with large refineries in the US and
Japan, such as NCBI, UCSC, NIG and RCSB. It is exciting to see all of the
proto-refineries in Europe, which offer different core competencies and are
coalescing into a single robust, refinery: ELIXIR.
Like all analogies, this is not perfect.
The concept of free data sharing, which is at the heart of molecular biology,
does not fit well with this analogy. Although the complex process of providing
the necessary CPU, disk and network has some resonance with the internal “plant”
infrastructure, the fact that it is so generic and tradable does not. The EBI's
products are also directly used via the web, often without much intermediation
(no need for a network of gas stations, etc.). Nevertheless, the picture of a
complex interplay of inputs being progressively refined is helpful when trying
to disentangle some of our trickier problems.
I welcome feedback on this analogy, and to
what extent it helps one understand the EBI.
