Two weeks ago there was the announcement from John Marshall from Sanger for SAMtools 1.0 - one of the two most widely used Next Generation Sequencing (NGS) variant-calling tools embedded in hundreds if not thousands of bioinformatics pipelines worldwide. (The majority of germline variant calling happens either through SAMtools or the Broad's GATK toolkit.) SAMtools was started at the Sanger Institute by Li Heng when he was in Richard Durbin's group, and stayed at Sanger now under the watchful eye of Thomas Keene.
The 1.0 release of SAMtools has all sorts of good features, including more robust variant calling. But a real delight for me is the inclusion of CRAM read/write, which is a first-class format similar to SAM or BAM. SAM started this all off (hence the name samtools), and BAM was a binary and "standard" compressed form of SAM. CRAM was written to be compatible with the data model from SAM/BAM, and its main purpose is to leverage data-specific compression routines. These include reference-based compression of base sequences, based on work done by Markus Hsi-Yang Fritz and myself (blogged here, 3 years ago).
CRAM also brings in a whole set of compression tricks from James Bonfield (Sanger) and Vadim Zalunin (EMBL-EBI) that go beyond reference-based compression. James was the winner of the Pistoia Alliance's "Sequence Squeeze" competition, but sensibly said that his techniques would be best used as part of CRAM. James was instrumental in splitting out his C code into a reusable library (htslib), which is part of what SAMtools is built on. There is a whole mini-ecosystem of tools and hooks that enable reference-based compression to work, including a lightweight reference sequence server developed by Rasko Lenionen at EMBL-EBI.
SAMtools is basically about a series of good engineers (John, James, Vadim and others) each working on different components for NGS processing, under the bonnet - with the Sanger and EMBL-EBI investing considerable effort into making the machinery come together. Just as an internal combustion engine requires more complex engineering than mixing fuel, igniting it and making a car go, good engineering takes more than a proof of principle. Really elegant engineering is invisible, and that is what SAMtools offers: it just works. It has been great to see John and James work CRAM into this indispensable piece of software, and for Sanger coordinate the project so well.
Space saving and flexibilityWith this release CRAM becomes part of the natural SAMtools upgrade cycle, so when people upgrade their installation they will immediately see at least a 10% saving on disk space - if not better. If they allow the new Illumina machines to output the lower entropy quality values (this is the default), the savings will be more like 30-40%.
Another practical benefit of CRAM is its future proofing: CRAM comes "out of the box" with a variety of lossy compression techniques, and the format is quite flexible about potential new compression routines. We've seen that the main source of entropy is not the bases themselves, rather quality values on DNA sequence bases. CRAM provides options for controlled loss of precision on these qualities (something Markus and I explored in the original 2011 paper). It's important to stress that the right decision for the right level of lossy compression is best done by the scientist using and submitting the data. It might be that community standards grow up about lossy compression levels - its important to realise that in effect Illumina is already make a huge host of appropriate "data loss" decisions in their processing pipeline, most recently the shift to a reduced entropy quality score scheme. The first upgrade cycle will allow groups to mainline this and give them the option to explore appropriate levels of compression.
The Sanger Institute has been submitting CRAM since February 2014. And in preparation for the widespread use of CRAM - with the option of lossy compression - the European Nucleotide Archive has already announced that submitters can choose the level of compression and the service will archive data accordingly. Guy Cochrane, Chuck Cook and I also explored the potential community usage of compression for DNA sequencing in this paper. So we have solid engineering, a flexible technical ecosystem and have prepared for the social systems in which it will all work.
R&D ...&D ...&D
When the first CRAM submission from Sanger to EBI happened about a year ago, I blogged about how well this illustrates the difference between research and development. Our research an proof of principle implementation on data-specific compression of DNA took perhaps 1.5 FTE years of work, but to get from there to the release of SAMtools 1.0 has taken at least 8 FTE years of development. In modern biology, I think we still underestimate the amount of effort needed for effective development of a research idea for infrastructure deployment, but I sincerely hope this is changing - this wont be the last time we will need to roll in a more complex piece of engineering.
SAM, CRAM and the Global Alliance for Genomics and HealthCRAM development has come under the Global Alliance for Genomics and Health (GA4GH), a world-wide consortium of 200 organisations (including Sanger and the EBI) that aims to develop technologies for the secure sharing of genomic data. CRAM is part of the rather antiquated world in which specific file formats are needed for everything (all the cool kids focus on APIs these days). But also represents an efficient storage scheme for the more modern API vision of GA4GH. APIs need to have implementations, and the large scale "CRAM store" at the EBI provides one of the largest single instances of DNA sequence worldwide. Interestingly the container based format of CRAM has strong similarities with row-grouped, column orientated data store implementations, common in a variety of modern web technology. We will be exploring this more in the coming months.
Upgrade now, please.
The main thing for the genomics community is to now upgrade to SAMtools 1.0. This will give you many benefits in the whole short read calling and management, and one of those will be being able to read/write CRAM directly. If you do this it will have a considerable saving in disk for your institute and for us at the EBI. It will also help ensure we're all prepared for whatever volume of data you might be producing in the future.