Saturday, 12 May 2012

DNA Compression: Reprise




    Around a year ago I wrote three blog posts about DNA sequence compression, based on the paper my student and I published early in 2011 and the start of cramtools by Guy Cochrane's ENA team under the engineering lead of Rasko Leinonen, which is the practical implementation of these ideas. The previous blog posts first describe the thought process of compression, in particular the shift from understanding what information you wish to keep to what information you know you are happy to discard, then the practicalities of compression and the balancing between the different communities in the tool chain that need to work with it, and then finally whether compression is a credible answer for the DNA sequencing technology that doubles its cost effectiveness somewhere between every 6 to 12 months.


    Those posts have held up over this year, though there are somethings I did not realise. The first was the conservativeness of the DNA processing community - they have been far more interested in lossless compression at first, perhaps understandably as lossless compression implies no change in the scientific impact. The useful thing of shifting to lossless compression first is that one doesn't even need to debate whether the compression will impact the science, and that later question takes time to frame and have the community buy into. By shifting into formats that have efficient both lossless and lossy compression modes one can ask people to transfer the toolchain with no impact on the science, and then we can, separately, explore the impact of lossy compression. In December of last year Vadim optimised cramtools to also support lossless compression (though of a specified list of attributes - for example, read names themselves are not supported). Lossless compression has a compression factor of about 2 (ie, files are 50% of the size) compared to input BAMs.


    There have been strong arguments, perhaps best put forward by Klaus Maisinger's team in Illumina that reducing the entropy of the quality score distribution - for example, by binning the quality values into 8 or 4 bins - does not have a substantial different in analysis. Currently there are 100 bins, but it's not the number of bins of quality values which effect the compressed storage footprint - rather it is the entropy of the signal distribution. The current 100 bin scheme takes between 3 to 4 bits per quality score when reasonably optimally compressed; an 8 or 4 binned scheme, assuming the bins are chosen to maintain most of the information, is around 1.5 to 2 bits/quality score. This stresses that the trick in compression is to understand the underlying entropy of your data, which sets the limit of what can be compressed losslessly; then lossy compression schemes can be considered as transformations of those distributions to new approximations which will have lower entropy.

    Even when lossy quality score compression is brought up, there has been an unfortunate set of assumptions that the phrase "reference based compression" implies that you make decisions about the loss of precision on qualities based on a reference alignment. Although this is a pragmatic way to do this, the decision making process is separate from the storage format - for example one could evenly degrade the entropy of all the qualities, or one could use a reference-free way to select out say particular kmers or particular bases where precision is best kept. Those decisions can then be transmitted into cramtools. It's clear though that the first level of lossy compression will be on quality score binning, as this is intuitively unbiased, and would probably save another factor of around 2, meaning that one can imagine a path up to a factor of around 4; files 25% of the current original files. At the scale that DNA sequencing works, this is a substantial saving in disk and bandwidth for both long and short transfers. However, with a multi-year view point, I am still convinced we will need to achieve factors closer to 20 or 100. But getting the tool chain to using these compressed formats is the starting point.

   Engineering aspects of switching over into a CRAM based tool chain is another sticking point, and one we anticipated. Our first target was to create the tools bam2cram and cram2bam to allow interconversion with the older format, and thus provide a path to flip at any point back into the established toolchain, which we released in September of last year. We knew that CRAM could not be considered a replacement to BAM unless it supported similar functionality in particular the ability to index to allow genome-orientated slices to be retrieved and the ability to store arbitrary read-based tags. Given equivalent functionality, we can then look at integrating CRAM readers/writers into the main libraries for SAM/BAM, making it SAM/BAM/CRAM trio. We achieved much of this, with CRAM indexing occuring in December (.crai files), and in the March 0.7 release supporting arbitrary tags in CRAM files. In this we have to make a clear distinction between CRAM as a general format in which people can do whatever they like with these tags, and CRAM as an archive format, in which we want to have control of what is stored. The "compromise" is that there will be a (small) set of archive tags which, if present, are stored (for example, mapping quality). As the whole point is to the reduce the footprint when archived leaving an arbitrary tag scheme open for archiving is just inviting it to be ... take advantage of in an uncontrolled way. One can get into a very long argument about who should be making these decisions, and how much flexibility submitters should have vs archives should enforce, but this position of having a fully flexible format, but where the archive indicates which parts of that flexibility is archives seems a good approach that most people are comfortable with, and it gives options for the analysis community without asking the archives to promise storing everything.


   The ENA team under Guy Cochrane and Rasko Leinonen has been having a slow but steady transition from a very mixed language environment (C++, Perl, Java, odd bits of PL/SQL) to unifying on Java. This made the choice to  use Java a natural one, and the Picard BAM read/writer library from Broad defined some clean Java interfaces about the implementation of BAM reader/writers. This has allowed cramtools to not only have a classic commandline structure (as now seems to be common in samtools, vcftools, git etc etc) but also to comply to the Picard interfaces. For some programs the cramtools library can be used as a drop in replacement with no changes in the calling code (which is a tribute to the Java interface system as well as the Picard engineers) and in others - for example, integration into IGV, has required only minimal changes. This has allowed Illumina to build a compressed view of sequence data using cram, and shows the sensibility of both having defined libraries and coordinating development. Sadly though one of the most important Java toolkits that use BAM - GATK (also from Broad) only use Picard interfaces for some of their accesses to BAM - others go through "round-the-back" code paths to allow (I think) for things like highly parallel accesses. This would be able to be done in CRAM as well, but we need to work with the GATK/Picard groups to work out what GATK needs, and ideally create appropriate Java interfaces so that the implementation of DNA sequence access can be separated from the details of calling etc in a way which is still performant. Time to buy Mark DePristo and Toby Bloom beer/chocolate ...

   The  other big code base for SAM/BAM is the original samtools from Li Heng, written in C. We've deliberately concentrated on one implementation (Java) first to kick out all the bugs and issues, but we're now in a position to also look at a C layer. Hopefully Markus - the original developer for this reference based compression - is going to write an initial C version; there is a host of format details that must be locked down on the creation of two independent code bases, but experience is that two code bases also push out otherwise hard to find bugs in the schemes. I'm hopeful that once we have an initial C libraries that integration - either elegantly or more "hacker" like into samtools would be possible. I might even offer to dust off my C fingers and do it myself, though having senior management involved in the critical path of low level code is probably not ideal, and I am not totally confident that I can get my head around samtools internals. Anyway this is due to be in "alpha" in the "summer of 2012" which is one of those gloriously ambiguous software timelines. Time to buy Markus beer/chocolate. I also had a good chat on the stairs at Cold Spring Harbor with Gabor Marth about the C++ bamtools library and hopefully this will be the third library targetted between Gabor's team and Guy's team - though I think it's an open question whether this is done at pure C++ level as a third implementation or calling out to the C routines.

   The ENA has had a 3 decade long collaboration (INSDC) with the GenBank project, run by NCBI. If compression is to become a standard part of the way we think about archives then both sides of INSDC have to support it in the same manner. Last year NCBI announce their compression scheme (cSRA - standing for compressed SRA) which has many similarities to CRAM, but built inside their own in-house toolkit scheme in which they have invested considerable effort and engineering around. After a little bit of friendly jostling, it's clear that we have compatible lossy compression schemes, ie, that the loss of precision one can represent in CRAM can be lossless and efficiently stored in cSRA and visa-versa. We still need to do some roundtripping to ensure this is all ok, but there is no conceptual block on this.

   Alongside all of this we at the EBI - and the ENA team specifically in this case - have really benefited from being able to talk to Sanger production people and developers directly over coffee (Sanger and the EBI are on the same physical campus in south cambridgeshire). Sanger produces a substantial amount of the world's total DNA output, and is the largest submitter (alongside BGI) to the ENA directly; ENA then synchronises with GenBank and DDBJ with the world's capture of DNA sequence running across the atlantic every night. Not only is Sanger a big submitter, but they are also big players in downstream analysis of DNA sequence, such as the 1,000 genomes project and the ICGC project. This means the full tool chain of sequence generation, archiving and analysis is a shared coffee away in the Pebbles cafe at Hinxton. We know it's not easy for production groups to change existing, working code, and one has to both make strategic arguments, understand scientific details (such as the impact of lossy compression) and the considerable number of engineering details. We've had good discussions strategically with Tim Hubbard and Richard Durbin, scientifically with James Bonefield, Richard Durbin and others and engineering with Tony Cox and his team. A particularly useful interaction has been with James Bonfield, who has pushed sequence compression far further, using both elegant sometimes somewhat inelegant remodelling of the data. James won the recent Sequence Squeeze competition, but rather nobly he doesn't see his code base as production level engineering, rather more of a test bed for new compression schemes, some of which will be of use for CRAM. If we can get the big projects (1000 genomes, ICGC, TCGA) using compression, and thus the big production centres, then that's (a) most of the volume and (b) will solidify the use of compression for other groups to use and pick up.

    And the reassurring thing is that these large projects - where disk cost becomes a substantial item - have been assessing this from the turn of the year through to now,  both in TCGA and 1000 Genomes, and with many individual ICGC groups looking on with interest. TCGA focused on pure base+quality lossless goal, which helped us define precisely what we meant by lossless, and is useful in that any difference in analysis from the data has to be explained either in the definition of lossless or the implementation. The 1000 genomes would like to explore both a similar lossless representation and a uniform loss of entropy on qualities (8 bin implementation).


    So - looking back over this last year we had perhaps mispriortised somethings at the start (eg, lossless compression) but overall we are on track. This year is the year where we should finally see it implemented in these big projects and the big centres - from there we can see the roll out to smaller groups and, over time, more sophistication in the controlled loss of precision to achieve even more compression.


   Not quite mission accomplished, but substantial progress, and on the right track.


   Hats off to the main people in the EBI at this : Vadim, Rasko and Guy.