(First off, I am back to blogging, and this is the first of three blogs
I'm planning about DNA compression)
Markus Hsi-yang Fritz - a student in my group - came up with a reference sequence compression scheme for next generation sequencing data. This has just come out in "proper" print format this month (it was previously fully available on line...) (Full open access). Markus not only handled the bases, but also the key other components in a sequencing experiment - read pairing, quality values and unaligned sequence. For bases and read pair information, the compression is lossless. For the quality information and unaligned sequence one has to be lossy - if not, we can't see a way of making an efficient compression scheme. But one can quite carefully control the where one keeps and loses data ("controlled loss of precision" in the terminology of compression) meaning that one can save information around important regions (eg, SNPs, CNVs etc).
In our paper we show that one can achieve 10-fold to 100-fold better compression than current methods (gzipped FASTQ, or BAM files), and rather brilliantly not only does this compression scheme gets better with read length, but also the rate of improvement with respect to read length can be tuned by the amount of precision being held on the quality information. As increase in read length is a reasonably large component of why next generation sequencing is improving so fast, this makes this compression partially mitigate the gap between next generation sequence technology improvement and disk drive improvement. By judicious tuning of the "precision" at which we hold quality data (progressively tightening this over time, presumably because we will understand where to keep precision better) we believe there is a sustainable storage model (sustainable for us means a constant money spend each year on disk) for next generation sequence data for at least 5 years, if not 10.
Due to this paper and thinking about this problem more generally, over this last year I've learnt a lot more about compression. Video compression - for the entertainment industry (all sorts!) - is a strong driver for compression technologies; not all of this work is accessible (both in a publication sort of way and in a easy-to-understand way) but nevertheless reading about video compression has radically changed my thinking about how to compress DNA data. There are three, interlinked themes I've learnt.
1. Lossy compression makes you think about what information you can lose, rather than what information you want to keep. You might think that as this is just partitioning the information total information into these two categories, and as there is just one line to draw, it doesn't actually matter which way around you define this. But that's not actually the case - it's clear that one's going to have different levels of certainity about how important particular bits of information are. So really you end up ranking all the information from "I am sure this is important" to "I am sure this is unimportant". Lossy compression basically is about approaching this as discarding from the "I am sure this is unimportant" end and working towards the more uncertain "I am not clear whether it's important or not", leaving the "I am sure this important" as the last to go. This is the absolute opposite from assessing information in terms of how well it fits a biological model.
2. By thinking about this problem as a lossy compression problem, and by realising that one can tune the "aggressiveness" of the compression, one naturally thinks about the budget tradeoffs - how much space/bandwidth is available, rather than how much space does this particular compression take. In skype video calls, and some of the more sophisticatd internet video codecs, the system deliberately estimates the bandwidth and then dynamically adjusts the compression scheme to discard as a little as possible consistent with the bandwidth constraints. This is also closer to what we have; a certain budget for storage, and we want to fit DNA sequence into that budget, though one has to always be careful to understand how degraded the resulting signal is. For skype, low bandwidth means dropping the call; for storing cancer genomes, we can't be quite so trigger happy... Still - the basic set up is "what should I fit into this relatively fixed envelope of storage/bandwidth" not "how much storage is this going to cost me".
3. It is far easier to model and understand systematic noise processes introduced by measuring procedures than modelling the "real world" of things you don't necessarily have full models for. You'll be surprised about how much bandwidth on a high image definition TV is going on color differences in which people _know_ can't be a good representation of the image as the original s sensor of the image has a poorer color resolution than the output device. In other words, sometimes, people are effectively adding known noise to image systems. In the case of DNA, we (... or at the very least vendors) know alot about the different error modes in the physical process of sequencing. It is precisely these errors which lead to the majority of the information stored for a read set after reference based compression (without using reference based compression, most of the information is indeed in the fundamental bases and qualities, but reference based compression effectively subtracts out repeated, redundant base information).
So - these three themes - we're trying to rank "confident" noise information, that we're working to fit data inside fixed (or at the best, less flexible) storage budgets and that we're more confident about our "noise" models than our "biology" models means that lossy compression - of which our proposed reference based compression is one sort - is very, very different from thinking about this problem as (say) an optimal analysis problem, or how does one store analysis results.
More on the practicalities compression next week....