## Tuesday, 3 July 2012

### Galois and Sequencing

It is not often anyone will hear the phrase "Galois field" and "DNA" together, but this paper from my colleagues, Tim Massingham and Nick Goldman provide a great link between these topics. Some other authors have used Galois fields in DNA analysis, but this is the first time I have seen a practical application of this level of mathematics in bioinformatics. It's a tour de force by Tim, and although only in a lowly BMC Bioinformatics journal I think should be celebrated for its sheer chuptaz in cross scientific - indeed academic - domains.

If you have not met Galois field, then a small crash course in some pretty dense pure maths. Fields are the rather glorious, whimsical world of pure maths where one gets to fool around with the fundamental of maths - in this case redefining addition and multiplication; if certain conditions are met (that one can "add" and "multiply" in any order, and that multiplying an addition is the same as adding together the multiplication of each element, and a couple of other requirements), one has a "Ring", (with the wonderful Tolkein like world of ring theory). With a couple more criteria to meet, in particular that multiplication doesn't care about the order one has a Field. Numbers are fields - real numbers, complex numbers, rational numbers... but so are all sorts of other things - vectors, and exotic beasts like p-adic numbers which somehow involve primes in a suitably Alice-in-wonderland like way. Importantly, one can also have finite element fields, in which there is a limited number of elements. A gifted, young frenchman, Galois, explored the properties of these finite elment fields and showed that there is a limited number of fields - in fact, for finite elements with 2, 3 or 4 members there is only one field - ie, only one way to define "addition" and "multiplication" and satisify the criteria of a Field. (If you are curious, the 2 element finite field is like an XOR and an AND in logic for "add" and "multiply"). Many wonderful and deep things have been proved using Galois Fields, in both pure and applied maths.

So much for Maths. Now onto sequencing chemistry.

At the EBI we were funded to explore Lifetech's new "Exact Call Chemistry" (ECC). Normal Lifetech Solid chemistry reads bases in pairs, where two adjacent bases gives one read out. Because there are 16 possibilities for two adjacent bases, but only 4 fluorophore read outs, each read out is ambigous, representing two possible scenarios. The sequence of these ambigous calls is called "Colorspace", with the set {0,1,2,3} to distinguish it from the underlying bases ("Basespace" in Solid-speak, with {A,C,T,G} ). As the very first, primer, base is known due to the way the Solid chemistry works, this means one can in theory work out the next base (first in the read), and chain down the entire read. But people rarely do this because if you make an error in one position, the error propagates throughout the rest of the read. It is far more appropriate to do all the calculations (such as alignment, snp calling and even assembly) in colorspace, and then "key" the answer into basespace right at the end. A whole host of tools have sprung up around the colorspace world.

Exact Call Chemistry added another ligation-read step where rather than interrogating the sequence in adjacent pairs, interrogated it in a series of read-2, skip 1, read 1, skip 1 - a complex overlay. This pattern was chosen for its error correcting properties, and had the useful side effect that one could easily map the read directly into basespace. Using this though much of the sequencing errors can be corrected, meaning that one could get to something like 10-6 error rate on the chemistry. But this poses lots of questions - the error model is no longer a simple process associated with each base, rather one has to take a rather gestalt view of the error process across the entire read.

But how does one do this? How does one represent the combination of this colorspace plus this 2on-1off-1on-1off read? Here Tim rather beautifully brings in a Galois Field - remember that there is only one 4 element Galois field, and the design of colorspace allows for a one to one mapping of the 4 elements, traditionally called {0,1,a,b} (or sometimes alpha and beta) to both colors {0,1,2,3} and bases {A,T,G,C}. The color that occurs between two bases is just "addition" in the Galois field. This is a very elegant way to consider colorspace, but really comes into its own for ECC space. The additional 7th read of ECC space with this complex structure is a type of matrix multiplication in the Galois field. (at this point by the way my mathematically abilities have been stretched to breaking point by Tim and Nick and I just have trust them). By using this transformation therefore a lot of the things you might want to do with ECC that can be now written down as "straightforward" maths in this transformed space.

So now Tim can explore the impact say of an error model considering all the separate reads independently (what he calls a "trellis" model), or other representations of the data, most obviously a base-space model of the data with independent error - ie, the traditional "fastq" model. Unsurprisingly the trellis model gives you far more information in, say, calling SNPs than the "fastq" model, because the underlying data has complex interdependancies between the errors - for example, a mistaken T for a G in position 1 implies also a particular error in position 3. However, a different representation, corrected colorspace, in which one stores an errorcorrected colour space read in fact maintains the majority of the information, but is far more compact. Furthermore, as it is error corrected, it will compress better into reference based schemes (something we're pretty obessed with at the EBI: see previous blog posts), and, indeed this compression is best thought of as a compression of the Galois field elements on the reference, which "naturally" compresses well. There are a bunch of other things that Tim can do in this framework, for example understand the number of errors that can be detected (up to 2) and the number that can be definitely corrected (only 1).

It's not clear how much of a future ECC chemistry has out there. Lifetech recently bought ion torrent (with a completely different, 454 like chemistry, with very different error properties). It's clear to me that if colorspace and friends (like ECC) was the only way to sequence DNA, we'd all know this backwards, and "Galois field" would become as commonplace as "Dynamic programming" in bioinformatics circles. But whatever the future of the chemistry, it's great to see a relatively deep piece of maths (and pretty modern - from the 17th Centuary - in terms of pure maths) being used in a totally profound way in bioinformatics. I have no doubt that over time biology and bioinformatics will end up using more and more of the mathematical toolkit developed over the years.

Who knows, we (bioinformatics) might even inspire the development of new maths sometime in the future.

Subscribe to:
Post Comments (Atom)

## 5 comments:

That's excellent. Sounds like a cool paper. Minor correction - Galois lived in the 19th century. And of course Galois fields are even more celebrated amongst mathematicians due to the tragic nature of his death.

Ooops. Not quite sure how this ended up as 17thC in my brain. Will fix.

It's nice to see that trellis codes and error-correcting codes in general have come to DNA sequencing. I'd often wondered why Solid used such an error-prone, fragile coding scheme.

I think that it may be too late for error-correcting codes to save the Solid technology, but it would be interesting to see whether error-correcting codes can be applied to other sequencing technology.

Multiplication can also be visualized as counting objects arranged in a rectangle (for whole numbers) or as finding the area of a rectangle whose sides have given lengths (for numbers generally).Properties of Multiplication

Post a Comment