Ewan's Blog: Bioinformatician at large: CERN for molecular biologists

This September I visited CERN again, this time with a rather technical delegation from the EBI to meet with their ‘big physics data’ counterparts. Our generous hosts Ian Bird, Bob Jones and several experimental scientists showed us a great day, and gave us an extended opportunity to understand their data flow in detail. We also got a tour of the anti-matter experiments, which was very cool (though, sadly, it did not include a trip down to the main tunnels of the LHC).

CERN is a marvellous place, and it triggers some latent physics I learnt as an undergraduate. Sometimes the data challenges in CERN are used as a template for the data challenges across all of sciences in the future; I have come to learn that these analogies – unsurprisingly – are sometimes useful and sometimes not. To understand the data flow behind CERN, though, one needs to understand CERN itself in more detail.

CERN basics

Like EMBL-EBI, CERN is an international treaty organisation that performs research. It is right on the Swiss/French border (you can enter from either side). On the hardware side, CERN has been building, running and maintaining particle accelerators since its founding in 1954. These are powerful collections of magnets, radio-wave producers and other whiz-bang things that can push particles (often protons) to very high energies.

CERN’s main accelerators are circular, which is a good design for proton accelerators. To help the particles reach high speed you need to have them a vacuum, and circulating at close to the speed of light. Because this is done in a circular loop you need to have them constantly turning, which means you need some really, really BIG magnets. This means using super-conductors and, accordingly, keeping everything extremely cold (as super conductivity only works when cold). Just building all this requires the application of some serious physics (for example, they actively use the quantum properties of super-cold liquid helium in their engineering), so that other people can explore some profoundly interesting physics.

CERN’s ‘big daddy’ accelerator these days is the Large Hadron Collider (LHC), which produced the very fast protons that led researchers to the Higgs Boson. Their previous generation of accelerators (called SPS) are not only active but are crucial for the smooth running of the LHC. Basically, protons get a running start in the SPS before they start their sprint in the LHC, and “fast protons” are used in other experiments around CERN.

Research at CERN

When you visit CERN, a healthy percentage of the people you will see don’t actually work for CERN – they are just conducting their research there. At first it seems a bit chaotic, because everything doesn’t fit nicely into a formal ‘line management’ organisation. But, like other science consortia including those in biology, it is ordered by the fact that everyone is science focused. The main thing is that it does actually work.

‘Experiments’ at CERN refer to complex, multi-year projects. They are actually run by a series of international consortia – of which CERN is always a member. An ‘experiment’ at CERN can be anything from a small-scale, 5-year project to simulate upper atmosphere cloud forming behaviour to the very long term projects of building and running detectors in the LHC, like the CMS Experiment. For biologists, a small-scale CERN experiment maps to a reasonably large molecular biology laboratory, and a large-scale project dwarfs the biggest of big genomics consortia. High-energy physics does not really have the equivalent of an individual experiment as used in laboratory science (or at least I have not come across an analog).

Experimental consortia operating at CERN usually have charters, and are coordinated by an elected high-energy physics professor usually from another institute (e.g. Imperial College London). In addition to the major experiments on the LHC (Atlas, CMS, LHC-B and Alice), there are three anti-matter experiments and an atmospheric cloud experiment. They (nearly) all have one thing in common: they need a source of high-energy particles that can be supplied by one of the accelerator rings.

Really, really big data

CERN experiments produce a huge amount of data, and to understand this better it’s best to think of the data flow in five phases:

Phase 1. The detector. In most cases one has to convert the path of particles that have come from high energy collisions into some sort of data that be captured. This is an awesome feat of measurement using micro-electronics, physics and engineering. The primary measurements are for the momentum and direction of each particle coming from collisions inside the detector, though each detector is very bespoke in what precisely they are measuring.

Phase 2. Frame-by-frame selection of data. The decision-making process for this selection has to be near instantaneous, as the data rate from the detector is far too high to capture everything. So there has to be a process to decide which events are interesting. This is mixture of triggering for certain characteristic patterns (e.g. the Higgs Boson will often decay via a path that releases two Z bosons, which themselves release two Muons in opposite directions – spotting such paired Muons is an ‘interesting event’). On top of this there is a very thin random sampling of general events. The amount ‘interesting events’ are selected based on two things: what the experiment is looking for, and the data rate that can be ingested. Thresholds set to optimise the amount of interesting events collected given the data rate in the next phase.

Phase 3. Capture. The resulting filtered/sampled data are streamed to disk (ultimately tape; here’s an interesting piece by the Economist about tape at CERN). These data are then distributed from CERN (i.e. Tier-0) to other sites worldwide (i.e. Tier-1). People then crunch the collection of events to try to understand what was going on in the detector:

Phase 4: Basic data crunching. This first involves standard stuff, like creating particle tracks from the set of detector readouts. A lot of detailed background knowledge is needed to do this effectively – for example you can’t just look up detector specifications in the blueprints at the level accuracy needed for these experiments, and at the desired level of accuracy the detector will shift subtly month to month. A whole data collection needs to be calibrated. Intriguingly, they do this by having the detector left on with no protons going through the ring, and no bending magnets on the detector so that cosmic rays, entering at random directions through the detector, will provide calibration paths for the spatial detector components.

Phase 5: High-end data crunching. Once the basic data crunching is sorted, it’s time for the more ‘high end’ physics. They now look at questions like, what events did this collision produce? For example, the Higgs boson decays into a two Z bosons and then to muons, and the momentum of these muons specifies the properties of the boson. By having a collection of such events one can start to infer things such as the mass (or energy) of the Boson. (By the way, high-energy physics is a fearsome world but if you want a good introduction, Richard Feynman’s QED is very readable and informative, though not about this level of quantum physics, it's a good place to start.)

Parallels with biology...

At the moment, the main large-data-volume detectors in biology are sequencing machines, mass spec devices, image-collection systems at synchrotrons and a vast array of microscopes. In all these cases there are institutions with larger or smaller concentrations of devices. I am pretty sure the raw ‘data rates’ of all these detectors are well in excess of the CERN experiments, though of course these are distributed worldwide, rather than at a single location. (It would be interesting to make an estimate of the world-wide raw data rates from these instruments). The phases listed above have surprisingly close analogies to biology.

Some have drawn parallels between the ‘interesting event’ capture in high-energy physics with things like variant calling in DNA sequence analysis. But I think this is not quite right. The low-level ‘interesting event’ calling is far more analogous to the set of pixels => spot => read-calling process that happens in a sequencing machine, or to the noise-suppression that happens on some imaging devices. These are very low-level, sometimes quite crude processes to ensure that good signal is being propagated further without having to export the full data volume of the more “noise” or “background”.

We don’t usually build our own detectors in biology – these are usually purchased from machine vendors, and the ‘raw data’ in biology is not usually the detector readout. Take, for example, the distinctly different outputs of the first generation Solexa/Illumina machines. The image processing happened on a separate machine, and you could do your own image/low-level processing (here’s a paper from mycolleague Nick Goldman doing precisely that). But most biologists did not go that deep, and the more modern HiSeqs now do the image processing in-line inside the machine.

The next step of the pipeline – the standardised processing followed by the more bespoke processing – seems to match up well to the more academic features of a genomics or proteomics experiment. Interestingly, the business of calibrating a set of events (experiments in molecular biology speak) in a deliberate manner is really quite similar in terms of data flow. At CERN, it was nice to see Laura Clarke (who heads up our DNA pipelines) smile knowingly as the corresponding CMS data manager was describing the versioning problems associated with analysis for the Higgs Boson.

...and some important differences

These are the similarities between large-scale biology projects and CERN experiments: the large-scale data flow, the many stages of analysis, and the need to keep the right metadata around and propagated through to the next stage of analysis. But the differences are considerable. For example, the LHC data is about one order of magnitude (x10) larger than molecular biology data – though our data doubling time (~1 year) is shorter than their basic data doubling time (~2 years).

Another difference is that high-energy-physics data flow is more ‘starburst’ in shape, emanating from a few central sites to progressively broader groups. Molecular biology data has a more ‘uneven bow-tie’ topology: a reasonable number of data-producing sites (i.e. thousands) going to a small number of global archive sites (that interchange the data) and then distributing to 100,000s of institutions worldwide. For both inputs (data producers) and outputs (data users) the ‘long tail’ of all sorts of wonderful species, experiments and usage is more important in biology.

The user community for HEP data is smaller than in life sciences (10,000s of scientists compared to the millions of life-science and clinical researchers), and more technically minded. Most high-energy physicists would prefer a command-line interface to a graphical one. Although high-energy physics is not uniform – the results for each experiment are different – there is a far more limited repertoire of types-of-things one might want to catch. In molecular biology, in particular as you head towards cells, organs and organisms, the incredible heterogeneity of life is simply awe inspiring. So in addition to data-volume tasks in molecular biology, we also have fundamental, large-scale annotation and meta-data structuring tasks.

What we can learn from CERN

There is a lot more we can learn in biology from high-energy physics than one might expect. Some relates to pragmatic information engineering and some to deeper scientific aspects. There is certainly much to learn from how the LHC handles its data storage (for example they are still quite bullish about using magnetic-tape archives). We should also look carefully at how they have created portable compute schemes, including robust fail-over for data access (i.e. attempts to find data locally first, but then with fall-back on global servers).

There is a lot of knowledge we can share as well, for example in ontology engineering. The Experimental Factor Ontology’s ability to deal with hundreds of component ontologies without exploding could well be translated to other areas of science, and I think they were quietly impressed with they way we are still able to make good use of archived experimental data from the 1970s and 1980s in analysis and rendering schemes. In molecular biology, I think this on-going use of data is something to be proud of.

Engaging further with our counterparts in the high-energy physics fields at both information-engineering and analysis levels is something I am really looking forward to. It will be great to see Ian, Bob and the team at EMBL-EBI next year. CERN is a leader in data-intensive science, but it’s science is not a one to one mapping to everything else; we will need to adopt, adapt and sometimes create custom solutions for each data-intensive science.

Ewan's Blog: Bioinformatician at large

Monday, 14 October 2013

CERN for molecular biologists