Wednesday, 23 January 2013

The 10,000 year archive


The task: store a substantial amount of digital information for a future civilization to access
DNA has a good chance of lasting  10 000 (or more) years so long as long as it is kept cold, dark and dry. And of course, DNA is incredibly dense: at least 1 petabyte can be stored in 1 gram of DNA, and that includes a lot of built-in redundancy. It’s a very good information storage molecule, and Nature has been pretty clever in choosing it.
Ten thousand years takes us way back into human prehistory, before the earliest recorded writing appears on the scene (around 6000 BC to 3000 BC). If we could use DNA to create a digital record that’s good for 10 000 years, I’d say that should take care of most of our archiving needs. 
First, a patron
So imagine some billionaire – let’s call her Patricia – who wants her name to live on until the end of time. She is a regular on Sir Richard’s space shuttle, likes to fund new vaccines and lives in a House of the Future™. As her parting gift to the world, Patricia would like to capture much of the wisdom, culture as well as the follies and foibles of modern civilization and preserve them in perpetuity. 
There are groups – such as the The Long Now group that think explicitly about this – perhaps this is as simple as handing the technology over to such a group – .  But Patricia didn’t get her billions without reading the business plan, so let’s think through the details, just for fun (and, of course, for Pat).
The right stuff
We know we can’t just take a Nature paper, a tar-ball of Wikipedia and an Agilent synthesis machine and preserve all of history and contemporary culture in a minivan-sized piece of DNA for all time. We need to tackle quite a few practical issues first.
What to use for the storage? Glass? Metal? Ceramics? Or what? And should the DNA be double-stranded or single? Perhaps  we should try to mimic the environments around known, recoverable ancient DNA? There is plenty of research available on synthesis and storage, but details that don't matter on a timescale of a few years might make a huge difference over 10 000 years. It is a bit tricky to test something out on that kind of timescale; Pat won’t be having it.
Probably we should work out how to simulate 5000- or 10 000-year damage (e.g. from increasing cosmic rays) using the 10 000-year-old Cave Bear sequence as a comparator. (An aside: we’d need to sequence some of the current Bear species in order to be able to get a good  ancestral Cave Bear sequence – Cave Bears are extinct - if we want to get an accurate view of what ancient DNA error profiles look like in detail)  
This is actually a well-explored area for scientists involved in ancient DNA. Although one could never be sure of the precise physical storage method, I’m pretty confident we’d have a “reasonably good” approach.
Location, location location
Where are we going to put the archive? How can we be sure that future civilizations will know that it’s there, if it gets buried? How can we be sure someone won’t just drill right through it, bulldoze it for a new spacepad? (Nuclear waste storage researchers have been grappling with these questions for a while, so we should probably ask them for some input.)
If we can make one archive, we can probably make six. We could distribute them in a few places, same as we do with our data centres now. Let’s pick some cold, dark, dry mountain areas that aren’t going to melt and slide off into a pile of rubble, and which have no immediate geological plans to relocate. 
All we need is for one of these mountains to stand the test of time, and for a few lucky things to happen so the archive can be discovered. It should work; after all, one of the best-preserved bits of ancient DNA (the tip of the little finger from a Denisovan) was found in a cave in the Urals.
Read the instruction manual – in Maths
Now for the fun part: designing the bootstrapping procedure. Someone 5000 or  10 000 years from now is going to unearth our precious archive, and deserves a reasonable chance of retrieving and understanding what’s in it. 
We can’t just have the DNA minivan lined with instructions in English and Chinese – languages are too fluid for us to bank on their being intact thousands of years in the future. What’s more, there could well be a scenario in which all of history has been lost, and our explorer (let’s call him Joe-Ug) might not have the cultural references needed to deal with a set of ancient languages.
We’ll need to explain the DNA storage technique well enough that Joe-Ug and his friends can output a bitstream that describes the system in a universally comprehensible way. Later on, we’ll also need to provide the necessary cross-referencing language so that they can interpret the full archive. But first, we need to make a diagram to help them translate the DNA to a bitstream.
What’s DNA, anyway?
We will need to imprint the diagram on something very robust – a material that can remain intact for thousands of years. We should probably go with Nickel, which the Long Now group has settled on as a good idea – or gold as they did with the Pioneer/Voyager disks (might be too expensive though – not sure how many billions Pat made). To make sure we get the process right we will need to consult with material scientists – a well populated field.
For the diagram itself, we have to show the molecular structure of DNA – there is no guarantee that people will even know what that is anymore. We can use symbols alongside the relevant number of protons and neutrons – for example C6 for carbon and N7 for nitrogen. (Joe-Ug had better have clever friends. We should probably throw in a periodic table of the elements as well.)
The next part of the diagram could be an atomic diagram of the four DNA monomers mapped to the symbols. We’ll need to describe the codec, probably writing out a complete example in long hand; a binary message, the resulting theoretical complete DNA string, every fragment we synthised etc. 
Testing, testing, one, two…
But – we need to be careful here; if one detail is left out that we have thought about then a future scholar will be hosed. So it’s going to be important to provide lots of cross referencing information – rather like the message sent in Contact, with considerable extra detail.
If we have enough space (and why not?) we can do the same. How about writing out the first 100 numbers in binary, and circling the prime numbers? If Joe-Ug and his friends have any mathematical nous, they’ll be able to see that they are on the right track. Internal checks on the chemical symbols might include the orbital structure of electrons, and perhaps also highlighting Nickel (or whatever material we choose) in the periodic table? The more cross references we provide, the harder it will be to leave out a key concept. 
We would of course consult experts in pedagogy, but it seems reasonable to include a series of test DNA phials for future explorers to decode. These phials could contain multiple copies of the same message, and would be nestled securely in the engraving. Instructions would include the input message, the longer encoded message as DNA and the short fragments in long hand, such that every decoding step could be double checked physically and mathematically. 
I would include four or five test messages, some longer than others. 
Testing, testing…
It would be fun to test this. We could take a bunch of enthusiastic students who haven’t read the study in detail, give them the Nickel diagrams (perhaps on paper for the text) and so forth – sneaking in non-current symbols in all positions – and a few phials of the DNA. 
We’d set up a competition between groups. They could suggest experiments (e.g., can we put the material in the phial into a mass spec machine?) and hopefully will ultimately work out that it’s DNA in there, and asked for it to be sequenced. Patricia would bestow the prize, of course.
Don’t forget the Rosetta Stone
So, that’s the first task of bootstrapping sorted. The second bit is more complex, as it involves creating a sort of Rosetta Stone for future civilizations that may lack any good knowledge of current languages or notation – but hopefully they still have some records of some ancient texts (ancient greek was the key to using the Rosetta stone to understand Egyptian hieroglyphics). Following that model, we’d need to create the same message in multiple languages (the Rosetta Stone had three – but there is no need to be limited by that). Note we would need to have the symbols to bits (UniCode mapping) engraved somewhere.
The archive could include, say, the UN declaration of human rights, or Magna Carta, or the first chapter of the Art of War in multiple languages, with careful mapping of the words. Since we’re not too worried about the size of the archive anymore, it shouldn’t be such a big deal to overlap several texts digitally. 
I’m also going to vote for having some key messages re-encoded in modern languages every century or so, though I’d be surprised if this happened more than 10 times (and even that would be a stretch for modern civilisation) – hopefully Pat can set up some sort of endownment. If it worked, it would increase the chances that Joe-Ug’s clever friends will be able to find a starting point.
UNIX, naturally
Now to the structure of the archive itself: specifically, the layout of the files, directories and other formats. I think it’s safe to say that thousands of years from now people may need a description of the UNIX file system structure conventions, tar, and image and audio formats. Since one can, in theory, describe images and sounds mathematically without reference to a language, image and audio formats can be specialised aspects of the more generic bitstream process.
Wrapping up
I hope Patricia would be satisfied with our plan, which can be described in three parts: DNA to bitstream, language bootstrap and final archive. 
I visualise this rather like the tombs of ancient Egypt, with strong symbolism that you need to resolve before moving from one room to the next. I imagine Joe-Ug, 45 000 years from now, stepping through these dusty rooms and being confronted with all this ancient symbolism on some painstakingly engraved metal. I see him bringing in his clever friend, who invites his clever friends, and then a decade or so of multinational (assuming they have nations) efforts to decode the panels. Then, the big breakthrough that takes them into the second room and, finally, the delight and scholarship of future generations reading texts – some funny, some sad, some ingenious – from the distant past.
Who knows? Maybe this blog post will make it into the archive (I’d better keep Pat happy) and it will make a future scholar smile.


Using DNA as a digital archive media


Today sees the publication in Nature of “Toward practical high-capacity low-maintenance storage of digital information in synthesised DNA,” a paper spearheaded by my colleague Nick Goldman and in which I played a major part, in particular in the germination of the idea.
This is one of the moments in science that I love: an idea over a beer that ends up as a letter to Nature. 
Preserving the scientific record
About three years ago, I was spending a lot of time working out how the EBI would cope with the onslaught of DNA data, which has been growing at exponential rates. Worryingly, that rate has been much greater than the (also exponential) growth of disk space. Without a solution to this problem, our DNA archive could become unsustainable. 
An immediately practicable solution we developed was efficient, data-specific compression – something I’ve blogged about extensively. But in the long run, more dramatic measures might be needed to sustain the life science archives. Which got us thinking…
Where would science be without a pub?
At the end of a particularly long day, Nick and I were having a beer and talking about the need for dense, low-cost storage. We joked that of course the densest, most stable way to store digital information was in DNA, so we should just store sequenced DNA information in … DNA. We realised that this would only work if the method could handle arbitrary digital information – but was it possible? Still in a playful frame of mind, we got another drink and started to sketch out a solution.
There are two challenges when trying to use DNA as a general digital medium: First, you can only synthesise small fragments of DNA. We realised that the task of making a long “idealised” string of DNA from fragments is just a question of getting the assembly right. Synthesising DNA to our design would make this problem trivial – in fact, it doesn't actually have to be an assembly "problem" – you can give each fragment an index (encoded as bases) that provides instructions for how to re-assemble it with the surrounding fragments.
The second challenge is that both writing (synthesis) and reading (sequencing) DNA are prone to errors, particularly when there are long stretches of repeating letters. Creating an error-tolerant design is absolutely essential. There are a lot of error-correcting codes available in signal-processing technologies, but we needed one that could handle common DNA errors (homopolymers cause a real headache for both synthesis and sequencing). 
No repeats
We realised that we could fairly easily create a codec (jargon, stands for “code–decode”) that guarantees the elimination of homopolymers. We were also aware that that synthesis (writing) errors were going to be far more damaging that sequencing (reading) errors, as a writing error was more likely to effect a large proportion of the molecules of a particular design. 
So the codec we developed involves translating everything in to base 3, and has a transition rule that generates base 3 numbers from each base. Each base has four different synthesised designs: these occur in a staggered, tile-path fashion, as we would be generating millions of molecules per design. As an extra precaution against errors, we made sure that each fragment was going in a different direction (strand), because when things go wrong it’s often in a strand-specific manner. 
The next round
Another beer. A bit more serious now, we were determined to be very “belt and braces” about our new code. We called for more napkins and a new pen and tried to see how far we could push the idea. “Why not do it for real?” one of us asked. “Because it’s too expensive,” the other replied, naturally.
So in the bright light of day, we looked for an efficient (read: cheap) synthesis mechanism, and managed to talk to the research group at Aglient in California, headed up by Emily Proust. Excited by the idea, Emily’s group agreed to let us use their latest synthesis machine for this project and asked us to send us some stuff to store. 
What to pick? We wanted to show off the fact that our codec could be used on anything at all, so we picked the following items to send over to Agilent:
  • a .jpg file of a photograph of the EBI
  • an .mp3 file of a portion of Martin Luther King’s speech, “I have a dream” 
  • an .ascii file of all 157 of Shakespeare’s sonnets
  • a .pdf file of the Watson and Crick paper describing a DNA sequence
  • a self-referential pseudo-code of the codec used for the DNA encoding. 

Nick made the DNA designs. To double check, he simulated reads and tested to make sure he could recover all the files (all went fine). Then he checked again. Then he ftp’ed it all to Agilent.
Data in a speck of dust
About a month later a small box arrived at the EBI with six tubes. Nick – being a mathematician – had to be persuaded there was actually stuff in the tubes (DNA is very, very small). I assured him that the little speck in the tube must be bona fide DNA. He remained sceptical.
We brought in some experimental colleagues, including Paul Bertone, who helped sequence the speck. We even have a picture of Nick in a lab coat, pipetting (shocking, believe me). We did manage to recover the actual DNA sequence for all six files: five with absolutely no trouble at all, and the last with one “gotcha”. We didn’t fully think everything through (despite all our pains) and a tiny amount of data was missing – but we were able to recover the entire file with some detective work (check out the supplementary information).
We had done it! We had encoded arbitrary digital information in DNA, manufactured it, and read it back. But we had to wonder whether our result was actually useful. DNA synthesis at this scale is still more of a research endeavour: volumes are going up but the price is still very high (certainly much higher than hard disk or tape). 
If you can read wooly mammoth DNA…
DNA has a wonderful property: it can be stored stably without electricity, and needs no management beyond keeping it cold and dark. It is remarkably dense, even with the rather insane over-production of molecules (we calculated that we could have easily have gotten away with using a tenth of the DNA). Given all the design redundancy, we calculated that one gram of DNA would (easily) store one petabyte of information. 
We wrote a letter to Nature describing our codec and exploring some of the salient characteristics of DNA as a storage medium. Our encoding scheme can actually store up to a zetabyte of information, although given current prices this would be prohibitively expensive. More interestingly, because it costs nothing to maintain DNA storage (beyond the physical security of a dark, dry and cold facility like the Global Seed Bank in Svaldberg) at some point, DNA storage becomes the cheaper option. 
But can you afford it?
Using tape storage for comparison, we estimate that it is currently cheaper to store digital information as DNA only if you plan to store the information for a long time (in our model between 600 to 5000 years). If the cost of DNA synthesis falls by two orders of magnitude (which is kind of what happened over the past decade), it will become a sensible way to store data for the medium term (below 50 years). Which is not so mad.
There are some codas to this story. Zeitgeist will be zeitgeist, and since that fateful first beer a DNA-based digital storage method has been proposed by George Church and his colleagues. They used a similar indexing trick, but their method does not address error correction (indeed, they comment that most of their issues in errors were homopolymer runs). They submitted their paper around the same time as we did, and it was published as a Brevia in Science in 2012 (shucks – another one of those science moments). 
The 10,000-year archive
Nick and I have one more thought experiment to play out. Could we build an archive that stored a substantial amount of information for a future civilization to read?