Friday, 14 June 2013

The symbiosis of engineering and research

Symbiosis is the biological process where two organisms cooperate to such an extent that they become so co-dependent it is best to think of them as one. The specific orchids flowering at the right time for a specific butterfly, or the interweaving of algae and fungi that make lichens, or all eukaryotes with one of the most successful symbiotic deals done on this planet - between our bacterial mitochondrial ancestors and our nucleated ancestors. At the core of any successful symbiosis is complementarity of functionality, meaning the partnership is far more successful than each player on their own.
Champagne for Vadim from EBI (left) and James from Sanger (right)
 and the two teams

In bioinformatics we need creative and dedicated work - of different sorts - from researchers and engineers. I was reminded of this in the latest saga of sequence compression. Last month, the first CRAM-format submission to the European Nucleotide Archive (ENA) by a team at the Wellcome Trust Sanger Institute sparked a small celebration here on the Genome Campus (i.e., my handing over a bottle of champagne to the Sanger sequence core team, as promised),  and our novel compressed-sequence format formally entered full production mode with the CRAM 2.0 specification released this week.

A long time ago...

Three years ago, my student Markus Hsi-yang Fritz and I were kicking around ideas about DNA storage and rapid retrieval. EMBL-EBI as a whole was confronting the sharp rise in DNA sequencing production rates, largely because of next-generation sequencing. A common refrain was, “I paid more for my hard drive than for my sequencing.” Another was, “the rate of improvement in DNA sequencing technology is easily out-pacing disk drives”. People had started to despair that archiving DNA sequence was a hopeless case – seemingly obviously in the long term but also, but also, quite possibly, in the short term. 

Necessity is the mother of invention, and we knew we had to find a practical alternative. Markus and I talked about representing DNA sequence in an abbreviated form, in relation to a reference sequence (i.e., storing only the differences from a known sequence). This would obviously be far, far more compact than raw sequence, particularly once one accepts that the order in which reads are represented in a file is not relevant. 

From Research to Service, and back again.

But this only takes care of compression on a single and fairly superficial level. The far harder problem was how to handle quality information, and we started to chew away at that. This led to a Genome Research paper in 2011, outlining the CRAM concept. (I have blogged about the issue a few times since then.)

Now that the ENA is accepting CRAM-format submissions, the original idea has moved from the arena of research and proof-of-principle into the domain of production service.  Looking at the original Genome Research paper, there has both been a lot of work at the EBI and input from the community to making the CRAM format and toolkit viable. The final specification is tightly defined – and yet flexible – and the associated infrastructure is firmly in place to make CRAM work. The ‘proof of principle’ implementation described in the paper was written in Python, and we did not provide a separate definition for a compression format. It worked, but in no sense could it be used in any production environment, not least due to speed issues. We now have a separate, detailed specification and two different code bases: one in Java (from EMBL-EBI) and one in C (from the Sanger Institute).

Deeply geeky

The fact that the Sanger Institute committed to writing the C code was a critical step in the development CRAM. James Bonfield, the lead developer, is one of the true “deep geeks” of sequencing informatics. He started his career using the Staden package as a sequencing handling package for the MRC Laboratory of Molecular Biology (LMB), where Fred Sanger originally developed his breakthrough sequencing technique. James moved to Sanger and has been chipping away at the ‘coal face’ of sequencing ever since: Sanger-style, BAC finishing, Next Generation Sequencing (NGS) and beyond. He has probably grappled with every ‘edge case’ of sequence processing at least once, and most are old friends to him. When he won the Sequence Squeeze competition last year, he nobly – and sensibly – said that the best use of his code was to feed ideas into frameworks like CRAM. 

Having James write a C read/write layer into CRAM tightened up the specification considerably. It also subtly shifted CRAM towards being more of a framework in which multiple compression routines can co-exist, rather than being fixated on one compression routine. This makes CRAM more similar to the video codecs, where the format (e.g. H.264) acts as a container that can have a variety of different compression schemes.  It was a pleasure watching James and the ENA’s Vadim Zalunin discuss the finer details of byte definitions, indexing and coding trade offs as the CRAM 2.0 specification took shape. 

Making it work

The commitment of the Sanger Institute to using CRAM means that the entire ecosystem of using reference based compression must work for many use cases of the format. We need to have lightweight tools for users to specify and register references, and acknowledge that sometimes the reference will originate in-house.  To that end, Guy Cochrane’s ENA team worked with a Sanger team led by Tony Cox to develop a pragmatic ‘hash-lookup’ scheme that we believe will scale appropriately, as it is very compatible with local cacheing of information. 

Markus actually came back for a reprise – and provided the (rather unglamorous but much-needed) test suites used by the Java (cramtools) and C (scramble – soon to be samtools) codebases. Good test suites that explicitly try to re-create all the annoying edge cases are critical for robust engineering – so a big thanks to Markus.

The invisible format

The Sanger Institute has an on-going commitment to developing sequence-level tools. In taking on the development leadership for samtools (originally developed by Li Heng at the Sanger), they are planning to put CRAM read/writing as a backend. The Java based cramtools is already compatible with Picard, and we worked with the Broad Institute such there was no show stopper in integration into GATK - we're hopeful that CRAM read/writing will also be integrated into GATK (I have a promise of beer or chocolate for the GATK team).

So, using CRAM will be as simple as upgrading samtools, or in the future, other toolkits. The vast majority of users will never have to know about the details of the compression format – just as we casually throw around video files between the Internet, laptops and mobile phones without worrying about formats.

Fit for purpose

The trajectory from research to production-level service has been (relatively) smooth but steep. The reference-based compression scheme in CRAM is what Markus and I published in Genome Research but there is a world of difference between the paper and the specification, code and ecosystem of CRAM. Vadim and James, two skilled programmers, have spent between them over four years working on specification and code bases. After being parsed by two different brains and going through independent implementations, CRAM has arrived at a robust and practical specification. The CRAM format is extensible, and some of the niggly implementation quirks of SAM/BAM have been cleared up (e.g. the requirement for reference sequences to be smaller than 512 Megabases, even though we know of a number larger sequences). 

Research and service - a great symbiosis

It would be simplistic to see the original research as the ‘breakthrough’ and the next steps as ‘implementation’. If anything, the engineering details are more complex, more involved and require more nous than the research. All these components, working together, have been critical. If we took out any one of the people in this chain – Markus and me at the start; Vadim, Rasko and James engineering; Guy (EMBL-EBI) and Tony (Sanger) taking decisions and dedicating resource – it may never have happened. 

There is a big difference between how research functions and how infrastructures operate. Sometimes the engineering hits a problem that cannot be simply "engineered around" using existing tools. Good, applied, computer-science research might find an in-theory solution, but that solution needs to be folded back into the engineering. All of this of course is to support biological research with a minimum of technical fuss. 

Good research infrastructure pushes technical boundaries – and CRAM does just that. I am, needless to say, really proud to have been a part of it – but James and Vadim really earned those bottles of champagne.

1 comment:

Shiya Song said...

I don't see a big difference in saving read in such format with restoring vcf files. I think sequencing file can be viewed as raw data. As long as the subsequent analysis is accepted, we can store other results instead of raw data. Just like we won't store the original image file from HiSeq machine.