Friday, 14 June 2013

The symbiosis of engineering and research


Symbiosis is the biological process where two organisms cooperate to such an extent that they become so co-dependent it is best to think of them as one. The specific orchids flowering at the right time for a specific butterfly, or the interweaving of algae and fungi that make lichens, or all eukaryotes with one of the most successful symbiotic deals done on this planet - between our bacterial mitochondrial ancestors and our nucleated ancestors. At the core of any successful symbiosis is complementarity of functionality, meaning the partnership is far more successful than each player on their own.
Champagne for Vadim from EBI (left) and James from Sanger (right)
 and the two teams

In bioinformatics we need creative and dedicated work - of different sorts - from researchers and engineers. I was reminded of this in the latest saga of sequence compression. Last month, the first CRAM-format submission to the European Nucleotide Archive (ENA) by a team at the Wellcome Trust Sanger Institute sparked a small celebration here on the Genome Campus (i.e., my handing over a bottle of champagne to the Sanger sequence core team, as promised),  and our novel compressed-sequence format formally entered full production mode with the CRAM 2.0 specification released this week.

A long time ago...


Three years ago, my student Markus Hsi-yang Fritz and I were kicking around ideas about DNA storage and rapid retrieval. EMBL-EBI as a whole was confronting the sharp rise in DNA sequencing production rates, largely because of next-generation sequencing. A common refrain was, “I paid more for my hard drive than for my sequencing.” Another was, “the rate of improvement in DNA sequencing technology is easily out-pacing disk drives”. People had started to despair that archiving DNA sequence was a hopeless case – seemingly obviously in the long term but also, but also, quite possibly, in the short term. 

Necessity is the mother of invention, and we knew we had to find a practical alternative. Markus and I talked about representing DNA sequence in an abbreviated form, in relation to a reference sequence (i.e., storing only the differences from a known sequence). This would obviously be far, far more compact than raw sequence, particularly once one accepts that the order in which reads are represented in a file is not relevant. 


From Research to Service, and back again.


But this only takes care of compression on a single and fairly superficial level. The far harder problem was how to handle quality information, and we started to chew away at that. This led to a Genome Research paper in 2011, outlining the CRAM concept. (I have blogged about the issue a few times since then.)

Now that the ENA is accepting CRAM-format submissions, the original idea has moved from the arena of research and proof-of-principle into the domain of production service.  Looking at the original Genome Research paper, there has both been a lot of work at the EBI and input from the community to making the CRAM format and toolkit viable. The final specification is tightly defined – and yet flexible – and the associated infrastructure is firmly in place to make CRAM work. The ‘proof of principle’ implementation described in the paper was written in Python, and we did not provide a separate definition for a compression format. It worked, but in no sense could it be used in any production environment, not least due to speed issues. We now have a separate, detailed specification and two different code bases: one in Java (from EMBL-EBI) and one in C (from the Sanger Institute).

Deeply geeky


The fact that the Sanger Institute committed to writing the C code was a critical step in the development CRAM. James Bonfield, the lead developer, is one of the true “deep geeks” of sequencing informatics. He started his career using the Staden package as a sequencing handling package for the MRC Laboratory of Molecular Biology (LMB), where Fred Sanger originally developed his breakthrough sequencing technique. James moved to Sanger and has been chipping away at the ‘coal face’ of sequencing ever since: Sanger-style, BAC finishing, Next Generation Sequencing (NGS) and beyond. He has probably grappled with every ‘edge case’ of sequence processing at least once, and most are old friends to him. When he won the Sequence Squeeze competition last year, he nobly – and sensibly – said that the best use of his code was to feed ideas into frameworks like CRAM. 

Having James write a C read/write layer into CRAM tightened up the specification considerably. It also subtly shifted CRAM towards being more of a framework in which multiple compression routines can co-exist, rather than being fixated on one compression routine. This makes CRAM more similar to the video codecs, where the format (e.g. H.264) acts as a container that can have a variety of different compression schemes.  It was a pleasure watching James and the ENA’s Vadim Zalunin discuss the finer details of byte definitions, indexing and coding trade offs as the CRAM 2.0 specification took shape. 

Making it work


The commitment of the Sanger Institute to using CRAM means that the entire ecosystem of using reference based compression must work for many use cases of the format. We need to have lightweight tools for users to specify and register references, and acknowledge that sometimes the reference will originate in-house.  To that end, Guy Cochrane’s ENA team worked with a Sanger team led by Tony Cox to develop a pragmatic ‘hash-lookup’ scheme that we believe will scale appropriately, as it is very compatible with local cacheing of information. 

Markus actually came back for a reprise – and provided the (rather unglamorous but much-needed) test suites used by the Java (cramtools) and C (scramble – soon to be samtools) codebases. Good test suites that explicitly try to re-create all the annoying edge cases are critical for robust engineering – so a big thanks to Markus.

The invisible format


The Sanger Institute has an on-going commitment to developing sequence-level tools. In taking on the development leadership for samtools (originally developed by Li Heng at the Sanger), they are planning to put CRAM read/writing as a backend. The Java based cramtools is already compatible with Picard, and we worked with the Broad Institute such there was no show stopper in integration into GATK - we're hopeful that CRAM read/writing will also be integrated into GATK (I have a promise of beer or chocolate for the GATK team).

So, using CRAM will be as simple as upgrading samtools, or in the future, other toolkits. The vast majority of users will never have to know about the details of the compression format – just as we casually throw around video files between the Internet, laptops and mobile phones without worrying about formats.

Fit for purpose


The trajectory from research to production-level service has been (relatively) smooth but steep. The reference-based compression scheme in CRAM is what Markus and I published in Genome Research but there is a world of difference between the paper and the specification, code and ecosystem of CRAM. Vadim and James, two skilled programmers, have spent between them over four years working on specification and code bases. After being parsed by two different brains and going through independent implementations, CRAM has arrived at a robust and practical specification. The CRAM format is extensible, and some of the niggly implementation quirks of SAM/BAM have been cleared up (e.g. the requirement for reference sequences to be smaller than 512 Megabases, even though we know of a number larger sequences). 

Research and service - a great symbiosis


It would be simplistic to see the original research as the ‘breakthrough’ and the next steps as ‘implementation’. If anything, the engineering details are more complex, more involved and require more nous than the research. All these components, working together, have been critical. If we took out any one of the people in this chain – Markus and me at the start; Vadim, Rasko and James engineering; Guy (EMBL-EBI) and Tony (Sanger) taking decisions and dedicating resource – it may never have happened. 


There is a big difference between how research functions and how infrastructures operate. Sometimes the engineering hits a problem that cannot be simply "engineered around" using existing tools. Good, applied, computer-science research might find an in-theory solution, but that solution needs to be folded back into the engineering. All of this of course is to support biological research with a minimum of technical fuss. 

Good research infrastructure pushes technical boundaries – and CRAM does just that. I am, needless to say, really proud to have been a part of it – but James and Vadim really earned those bottles of champagne.

Tuesday, 9 April 2013

Structural Biology - the business end of life.


As part of my Biochemistry degree at Oxford, I had to spend a year focusing on a single research project. My obsession with bioinformatics was already firmly established when Iain Campbell, a leading NMR spectroscopist and structural biologist, took me under his wing. At the time, structural biology was definitely the most computational area of molecular biology, so I was looking forward to getting stuck into a computational project.
Type III fibronectin determined by NMR, 
from I. Campbell's group 
It was great to be immersed in this technical world – the COSY and NOESY spectra, triggered by a series of radio pulses to link up atoms; the rather impressive cooling process, with liquid helium being poured into huge superconducting magnets sunk into the ground; the somewhat scary signs warning people with pacemakers to turn back (the magnetic fields are insanely strong in and around an NMR machine)...
I learned a lot about NMR and structural biology, from the technical aspects of chemical shifts, coupling constants, distance restraints, disordered regions and hydrophobic cores to the elegance of protein structures that manage to fold perfectly to do something so absolutely specific.

Seeing is believing

But … I had already heard the siren call of simpler, linear protein and DNA sequences, and there was this wonderful new institute going to be set up – the Sanger Centre – and I had a chance to work there on sequencing the human genome…
… fast forward 20 years, when I had the pleasure of sitting in the back of the room for the Protein Data Bank in Europe (PDBe) Scientific Advisory Board, now as Associate Director of the EBI. I was still just in awe of the incredible beauty and precision of protein structures, and the skills of structural biologists in uncovering their details.
Cryo electron tomography of sensory cilia 
Some things had not changed in 20 years: dihedral angles are still important, transitions from ordered to disordered are still being explored and the methods are still extremely technically detailed. But other areas have progressed so much they are almost unrecognisable: the ability to look at larger complexes, with electron microscopy (EM) techniques – single-particle averaging and, even more impressive to me, electron tomography. Electron tomography allows you to reconstruct a single 3D sample to ~40 Å from images taken at a series of sample tilts – no crystal, no averaging, just for this particular sample  – like a high-resolution 3D microscopy image. These are spine-tingling images.
So often we have to conceptualise and imagine what is going on in cells. Electron tomography is the closest thing I’ve seen to actually seeing molecular biology in action. One can see little ribosomes, microtubules and proteasomes and complex membrane-associated structures in a bacterial cell, in a single 3D volume.

Keeping up

The wealth of structural data has grown incredibly over the past decade. New techniques such as EM are constantly emerging and structural biology’s workhorse, X-ray crystallography, is continually being refined with better production and crystallisation techniques and tuneable high-energy X-rays from synchrotrons. Light microscopy has also improved vastly, with techniques such as super-resolution technology.
Integrating all the data being produced with these techniques to gain an overall view is an impressive task. It involves fitting X-ray structures into EM maps and then into tomograms, with NMR measurements to provide the dynamics at the atomic level and light microscopy to illuminate the dynamics at the complex and organelle level. There are still so many more protein structures to determine and integrate, and endless discoveries to be made.

Bringing structure to genomics?

All this progress is not just for the benefit of structural biologists. Gerard Kleywegt, who leads PDBe, has a passion for making this information accessible to the broader biology community. Molecular biologists, developmental biologists, geneticists and systems biologists can all make use (or more use...) of structural data.
All too often we forget that linear sequence shows only how information is encoded, not how it is used. The majority of things that happen outside of the nucleus, and certainly the vast majority of the “doing” of life, is executed by either proteins or RNAs folded up into specific structures and collaborating in specific complexes. We know a lot about these structures and complexes – 4,717 proteins (23% of protein coding  genes) have at least one structure (many proteins have far more than one structure), and this accounts for 42% of residues in these proteins (around 11% of protein residues overall). When we expand this to things we can confidently model, this goes up.
I am sure that in my own research area – genomics – we’re not taking enough advantage of this information. We might think about structural biology as the final mechanistic determination of why one allele has an effect or not, but can we integrate structural information to make our statistical genetic tests more powerful? Can we use the collection protein structures of transcription factors (often bound to DNA) to help interpret DNaseI footprinting results? Or use protein-complex information to inform epistasis models, potentially at a residue/patch-of-protein, not just at the gene level?
Many fields use structural information in all sorts of ways but I am sure the integration of different structural techniques, and the integration of that structural information with other experiments and knowledge – chemistry, pathways, gene expression, proteomics – is going to be amazing.
Part of me wonders why I chose to stick with the “boring” world of linear, four-base DNA sequence some 20 years ago. I guess there’s always time to learn some new tricks…