This September
I visited CERN again, this time with a rather technical delegation from the EBI
to meet with their ‘big physics data’ counterparts. Our generous hosts Ian Bird,
Bob Jones and several experimental scientists showed us a great day, and gave
us an extended opportunity to understand their data flow in detail. We also got
a tour of the anti-matter experiments, which was very cool (though, sadly, it
did not include a trip down to the main tunnels of the LHC).
CERN is a
marvellous place, and it triggers some latent physics I learnt as an
undergraduate. Sometimes the data challenges in CERN are used as a template for
the data challenges across all of sciences in the future; I have come to learn
that these analogies – unsurprisingly – are sometimes useful and sometimes not.
To understand the data flow behind CERN, though, one needs to understand CERN
itself in more detail.
CERN basics
Like EMBL-EBI,
CERN is an international treaty organisation that performs research. It is
right on the Swiss/French border (you can enter from either side). On the
hardware side, CERN has been building, running and maintaining particle accelerators
since its founding in 1954. These are powerful collections of magnets,
radio-wave producers and other whiz-bang things that can push particles (often
protons) to very high energies.
CERN’s main accelerators
are circular, which is a good design for proton accelerators. To help the
particles reach high speed you need to have them a vacuum, and circulating at
close to the speed of light. Because this is done in a circular loop you need
to have them constantly turning, which means you need some really, really BIG
magnets. This means using super-conductors and, accordingly, keeping everything
extremely cold (as super conductivity only works when cold). Just building all
this requires the application of some serious physics (for example, they
actively use the quantum properties of super-cold liquid helium in their
engineering), so that other people can explore some profoundly interesting
physics.
CERN’s ‘big
daddy’ accelerator these days is the Large Hadron Collider (LHC), which produced
the very fast protons that led researchers to the Higgs Boson. Their previous
generation of accelerators (called SPS) are not only active but are crucial for
the smooth running of the LHC. Basically, protons get a running start in the
SPS before they start their sprint in the LHC, and “fast protons” are used in
other experiments around CERN.
Research at CERN
When you visit
CERN, a healthy percentage of the people you will see don’t actually work for
CERN – they are just conducting their research there. At first it seems a bit
chaotic, because everything doesn’t fit nicely into a formal ‘line management’
organisation. But, like other science consortia including those in biology, it
is ordered by the fact that everyone is science focused. The main thing is that
it does actually work.
‘Experiments’
at CERN refer to complex, multi-year projects. They are actually run by a
series of international consortia – of which CERN is always a member. An ‘experiment’
at CERN can be anything from a small-scale, 5-year project to simulate upper
atmosphere cloud forming behaviour to the very long term projects of building
and running detectors in the LHC, like the CMS Experiment. For biologists, a
small-scale CERN experiment maps to a reasonably large molecular biology
laboratory, and a large-scale project dwarfs the biggest of big genomics consortia.
High-energy physics does not really have the equivalent of an individual experiment
as used in laboratory science (or at least I have not come across an analog).
Experimental
consortia operating at CERN usually have charters, and are coordinated by an
elected high-energy physics professor usually from another institute (e.g.
Imperial College London). In addition to the major experiments on the LHC (Atlas,
CMS, LHC-B and Alice), there are three anti-matter experiments and an
atmospheric cloud experiment. They (nearly) all have one thing in common: they
need a source of high-energy particles that can be supplied by one of the
accelerator rings.
Really, really big data
CERN experiments
produce a huge amount of data, and to understand this better it’s best to think
of the data flow in five phases:
Phase 1. The detector. In
most cases one has to convert the path of particles that have come from high
energy collisions into some sort of data that be captured. This is an awesome feat
of measurement using micro-electronics, physics and engineering. The primary
measurements are for the momentum and direction of each particle coming from collisions
inside the detector, though each detector is very bespoke in what precisely
they are measuring.
Phase 2. Frame-by-frame
selection of data. The decision-making process for this selection has to be
near instantaneous, as the data rate from the detector is far too high to
capture everything. So there has to be a process to decide which events are interesting.
This is mixture of triggering for certain characteristic patterns (e.g. the
Higgs Boson will often decay via a path that releases two Z bosons, which
themselves release two Muons in opposite directions – spotting such paired
Muons is an ‘interesting event’). On top of this there is a very thin random
sampling of general events. The amount ‘interesting events’ are selected based
on two things: what the experiment is looking for, and the data rate that can
be ingested. Thresholds set to optimise the amount of interesting events
collected given the data rate in the next phase.
Phase 3. Capture. The
resulting filtered/sampled data are streamed to disk (ultimately tape; here’s
an interesting piece by the Economist about tape at CERN). These data are then
distributed from CERN (i.e. Tier-0) to other sites worldwide (i.e. Tier-1). People
then crunch the collection of events to try to understand what was going on in
the detector:
Phase 4: Basic data
crunching. This first involves standard stuff, like creating particle tracks
from the set of detector readouts. A lot of detailed background knowledge is
needed to do this effectively – for example you can’t just look up detector
specifications in the blueprints at the level accuracy needed for these
experiments, and at the desired level of accuracy the detector will shift
subtly month to month. A whole data collection needs to be calibrated.
Intriguingly, they do this by having the detector left on with no protons going
through the ring, and no bending magnets on the detector so that cosmic rays,
entering at random directions through the detector, will provide calibration
paths for the spatial detector components.
Phase 5: High-end data
crunching. Once the basic data crunching is sorted, it’s time for the more ‘high
end’ physics. They now look at questions like, what events did this collision
produce? For example, the Higgs boson decays into a two Z bosons and then to
muons, and the momentum of these muons specifies the properties of the boson. By
having a collection of such events one can start to infer things such as the
mass (or energy) of the Boson. (By the way, high-energy physics is a fearsome
world but if you want a good introduction, Richard Feynman’s QED is very readable and informative,
though not about this level of quantum physics, it's a good place to start.)
Parallels with biology...
At the moment,
the main large-data-volume detectors in biology are sequencing machines, mass
spec devices, image-collection systems at synchrotrons and a vast array of microscopes.
In all these cases there are institutions with larger or smaller concentrations
of devices. I am pretty sure the raw ‘data rates’ of all these detectors are well
in excess of the CERN experiments, though of course these are distributed
worldwide, rather than at a single location. (It would be interesting to make
an estimate of the world-wide raw data rates from these instruments). The
phases listed above have surprisingly close analogies to biology.
Some have
drawn parallels between the ‘interesting event’ capture in high-energy physics with
things like variant calling in DNA sequence analysis. But I think this is not
quite right. The low-level ‘interesting event’ calling is far more analogous to
the set of pixels => spot => read-calling process that happens in a
sequencing machine, or to the noise-suppression that happens on some imaging
devices. These are very low-level, sometimes quite crude processes to ensure
that good signal is being propagated further without having to export the full
data volume of the more “noise” or “background”.
We don’t
usually build our own detectors in biology – these are usually purchased from
machine vendors, and the ‘raw data’ in biology is not usually the detector readout.
Take, for example, the distinctly different outputs of the first generation
Solexa/Illumina machines. The image processing happened on a separate machine,
and you could do your own image/low-level processing (here’s a paper from mycolleague Nick Goldman doing precisely that). But most biologists did not go
that deep, and the more modern HiSeqs now do the image processing in-line
inside the machine.
The next step
of the pipeline – the standardised processing followed by the more bespoke processing
– seems to match up well to the more academic features of a genomics or
proteomics experiment. Interestingly, the business of calibrating a set of
events (experiments in molecular biology speak) in a deliberate manner is
really quite similar in terms of data flow. At CERN, it was nice to see Laura
Clarke (who heads up our DNA pipelines) smile knowingly as the corresponding
CMS data manager was describing the versioning problems associated with
analysis for the Higgs Boson.
...and some important differences
These are the
similarities between large-scale biology projects and CERN experiments: the
large-scale data flow, the many stages of analysis, and the need to keep the
right metadata around and propagated through to the next stage of analysis. But
the differences are considerable. For example, the LHC data is about one order
of magnitude (x10) larger than molecular biology data – though our data
doubling time (~1 year) is shorter than their basic data doubling time (~2
years).
Another
difference is that high-energy-physics data flow is more ‘starburst’ in shape, emanating
from a few central sites to progressively broader groups. Molecular biology data
has a more ‘uneven bow-tie’ topology: a reasonable number of data-producing
sites (i.e. thousands) going to a small number of global archive sites (that
interchange the data) and then distributing to 100,000s of institutions worldwide.
For both inputs (data producers) and outputs (data users) the ‘long tail’ of
all sorts of wonderful species, experiments and usage is more important in
biology.
The user
community for HEP data is smaller than in life sciences (10,000s of scientists
compared to the millions of life-science and clinical researchers), and more
technically minded. Most high-energy physicists would prefer a command-line
interface to a graphical one. Although high-energy physics is not uniform – the
results for each experiment are different – there is a far more limited
repertoire of types-of-things one might want to catch. In molecular biology, in
particular as you head towards cells, organs and organisms, the incredible
heterogeneity of life is simply awe inspiring. So in addition to data-volume
tasks in molecular biology, we also have fundamental, large-scale annotation
and meta-data structuring tasks.
What we can learn from CERN
There is a lot
more we can learn in biology from high-energy physics than one might expect. Some
relates to pragmatic information engineering and some to deeper scientific aspects.
There is certainly much to learn from how the LHC handles its data storage (for
example they are still quite bullish about using magnetic-tape archives). We
should also look carefully at how they have created portable compute schemes,
including robust fail-over for data access (i.e. attempts to find data locally
first, but then with fall-back on global servers).
There is a lot
of knowledge we can share as well, for example in ontology engineering. The
Experimental Factor Ontology’s ability to deal with hundreds of component
ontologies without exploding could well be translated to other areas of science,
and I think they were quietly impressed with they way we are still able to make
good use of archived experimental data from the 1970s and 1980s in analysis and
rendering schemes. In molecular biology, I think this on-going use of data is
something to be proud of.
Engaging
further with our counterparts in the high-energy physics fields at both
information-engineering and analysis levels is something I am really looking
forward to. It will be great to see Ian, Bob and the team at EMBL-EBI next
year. CERN is a leader in data-intensive science, but it’s science is not a one
to one mapping to everything else; we will need to adopt, adapt and sometimes
create custom solutions for each data-intensive science.
EFO link: http://www.ebi.ac.uk/efo ;)
ReplyDelete