Friday, 21 November 2014

Another step into Quantitative Genetics...

Today sees the publication of the paper by Zhihao Ding, YunYun Ni, Sander Timmer  and colleagues (including myself) on local sequence effects and different modes of X-chromosome association as revealed by the quantitative genetics of CTCF binding. This paper represents the joint work of three main groups: Richard Durbin's at the Sanger Institute, Vishy Iyer's at U. Texas, Austin and my own at EMBL-EBI. I'm delighted that this work from Zhihao, YunYun and Sander (the three co-first authors) that it's finally come out, and want to share some aspects of the work that were particularly interesting to me.

Stepping back into quantitative genetics

This is the second time I've shifted research direction. Even though it's still broadly in the area of bioinformatics, quantitative genetics is a discipline I hadn't explored very deeply before embarking on this paper. Quantitative genetics is a very old field - arguably one that lit the spark that set off molecular biology as a science - and the birthplace of statistical methods for life science. 
RNA Fish for X chromosome inactivation

Legends of frequentist statistics - Pearson (of Pearson's correlation) and Fisher (of Fisher's exact test) were motivated by biological problems around variation and genetics starting in the 1890s through the 1930s/1940s. This was also the time of the rediscovery of Mendel, and the fusing of Mendel's laws with evolution and a whole host of fundamental discoveries: for example, Morgan's discovery that chromosomes carry the units of hereditary information. My personal favourite is one on X-Y crossovers discovered in my favourite fish, Medaka in 1921, highlighted by a great editor's comment. Reading some of these papers can be spine tingling. These scientists did not know about DNA, DNA polymorphisms as we understand them, or genes as RNA transcription units. Still, they could figure out many features of how genetics effected traits.

Quantitative genetics fell out of fashion in the 1970s and 1980s as the new, wonderful world of molecular biology sprang up. The new capability to isolate, manipulate and use nucleic acids seemed to bypass the need to infer idealised things statistically from whole organismal systems. There was also a push to use forward genetics (i.e. large genetic screens) in model organisms - in fact, some of the best work was based on the same organism (Drosophila) on which the original genetics was worked out. 

In humans, the most systematic discovery of single locus genetic diseases - Mendelian disorders - led to epic, systematic "mapping" trips for these (CFTR being perhaps the most iconic). Only the plant and animal breeders kept the flame of quantitative genetics alive during this time.

Human genome + Dolly the sheep + transparent pigs = great time to be in genetics

The process of mapping human disease-causing genes was a big part of the motivation for the human genome project. At some point it became clear that rather than being done as a collection of rather individual laboratory efforts, this mapping would be far better off taking a systematic approach. John Sulston and Bob Waterston surprised everyone by showing that sequencing a large complex genome (in this case, C. eleganscould be done by scaling up the process of sequencing in a rather factory-like way - and this was just in time for the larger-scale human genome project. 

(At the time all this was getting underway, I was an undergraduate transitioning to a PhD student at the Sanger Institute. I came into it during a rather exciting three-year period when the public project grappled with the entry of a privately funded competitor, Celera. It was a fascinating time but I should perhaps write about it in another blog post. Back to quantitative genetics!)

I was introduced properly to quantitative genetics when I became part of the governance board of the Roslin Institute. (I still don't know who suggested me as a Board member - Alan Archibald? Dave Burt?) Scientifically and politically, it was a real eye-opener. I learned (rather too much) about how BBSRC Institutes are run and constituted, and started to appreciate the tremendous power of genetics. 

The Roslin was in the news for the successful nuclear transfer from a somatic cell to make a new embryo, Dolly the sheep, and this was great work. (They also made transgenic chickens and pigs - a glow-in-the-dark pig is quite a thing to see, and surprisingly useful). Together with the Genetics department in Edinburgh, the Roslin was one of the few places that still took quantitative genetics seriously. From talks with Chris Haley and Alan Archibald (and beers/scotch) I started to realise the awesome end-run that quantitative genetics does around problems. 

The great thing about genetics is that if you can measure something reliably and thus find the variance between individuals that is due to genetics, you can identify genetic components with the usual triad of good sample size, good study design and good analysis. How well you can do this requires understanding architecture of the trait (e. g. number of contributing loci, distribution of effect sizes) - itself something non trivial to work out, but in theory all measurable things with some genetic variance are discoverable, given enough samples. 

No guesswork required

Let me stress: you don't need to guess where to look in the genome to associate genetic variants to some you measure, so long as you test the entire genome sensibly. So - you needn't have a profound mechanistic understanding of what you want to measure,  - you just need the confidence that there is some genetic component there to and the ability to measure it well. Step back for a moment and think about all the things you want to know about an organism on the genetic level that you could measure. It could be something simple to measure, like height (which hides a lot of complexity), or something more complex, like metabolites (which are often driven by far simpler genetics), or something really complex, like mathematical ability in humans.

Chris, Alan and others really knew that quantitative genetics worked. For their own part, they were often paid by animal breeders (companies breeding chickens, pigs and cows) to improve rather complex traits such as milk yield or growth rate. These companies may have appreciated that good science was nice to have, but they were pragmatic and ultimately driven by the bottom line. If something did work, they would come back to Chris and Alan. Sure enough, Roslin had a lot of repeat business.

During this time I started to realise that a lot of quantative genetics methods developed at the end of 20th Cent.  were aimed to circumvent the facts that genomes are big and determining markers in many individuals is expensive. As next generation sequencing came online, I could see we were going to have a completely inverted situation: genotyping was going to get stupid cheap and dense, and accordingly any set of phenotypes (things you can measure) would become ammeanable to these techniques (again, with the caveats of appropriate sample size). And I really mean any phenotypes.

What to measure?

At some level, this is ridiculous. There are so many things you want to know about, on many scales - molecular, cellular, organism - and to have one technique (measure and associate) seems just a bit too conceptually simple. Of course, it's not so simple - there must be some genetic variance to get a handle on the problem (sometimes there are just no variants that have an effect in a particular gene), and the genetic architecture might be very complex, but these assumptions are reasonable for many, many interesting things - in particular the more "basic" things about living organisms. The big problem is choosing what to do. So, about five years ago I started to think more about what might be the most interesting things to measure, and in which systems.

For measurements it seems crazy to have a single variable captured each time. Why measure just one thing? You want a high-density phenotyping method so you can look at lots of variables at the same time. This leads you to two technologies for readout: Next-gen sequencing (all gene expression, or all binding sites) for molecular events or imaging cellular and organismal things. 

I knew I wanted to stay away from human disease as a system. This is a hotly contested field with a few big beasts controlling the drive towards different endpoints. I like working with these big beasts (they are usually very friendly and clever) but know it would be foolish to try and break into their area (far better to collaborate!). Talking to Chris and Alan (and later with epidemiologists George Davey-Smith and John Danesh), I came to realise that disease brings in all sorts of complications -- such as confounders of diagnosis and treatment. 

I also knew I wanted to work on things where you could close the loop from initial discovery to specific mechanism/result. and I wanted to work with experimental systems we could manipulate many times over.... 

Back to the paper...

And this, finally, leads us back to the new paper. We worked on a molecular phenotype: CTCF binding. CTCF is an interesting transcription factor that has a great chip antibody, which was great for the feasibility of our experiments. CTCF can be measured in a high-content manner (i.e. Chip-seq) when this is done using an experimentally reproducible system (i.e. human LCL cells; 50 of the 1000Genomes cell lines, so we had near-complete genotypes). It fits two criteria - it's a high dimensional phenotype (there are around 50,000 good CTCF sites in the genome) and we did tin an experimental system that we could go back to - LCLs. It is, arguably, my lab's first "proper" quantitative genetics paper. As ever when you switch fields, the project brought with it a multitude of details to sort out and ponder. 

For example, we had to learn to love qq-plots more than box-plots. QQ (quantile-quantile) plots make a big difference in whether your associations make sense, given the multiple testing problem. As the genome is a big place, you are going to test a lot of things, you are guaranteed to find "visually interesting" associations (boxplot looks good, Pvalue looks good) and you will not have any idea whether they are interesting enough given the number of tests. And - rather more subtly, you need to know whether your test is well behaved. A QQ plot summarises both of these neatly.  

There are deeper gotchas - notably population stratification - and getting a good feel for linkage disequilibrium does take a while. There are, of course, all sorts of measurement/technical issues, and in this case my experience with ENCODE gave me a good grounding in the practicalities of Chip-seq.

Much of what we discovered was to be expected. Indeed, variants do effect CTCF binding, particularly when they are in the CTCF binding motif (I can see a number of CTCF/transcription factor molecular biologists rolling their eyes at the obviousness of this). But there are a sizeable number of variants that are not in the motif and, presumably, affect binding via some other indirect mechanism (LD presents some complications here, but we're in a good position to assess this). We also saw a large number of allele-specific signals. Interestingly, we can show that when there is a difference between genetic variants for binding between individuals, it is related linearly to allele-specific levels in hetreozygous individuals - but not the other way around. But - there are allele-specific sites that do not show between-individual differences. This is not perhaps that surprising but good to see, and good to quantify the level of this.


And then biology just does something unexpected. As my student Sander Timmer was looking to find individual CTCF-to-CTCF "clean" correlations, he came upon a stubborn set of sites that formed a big hairball of correlations. This is not uncommon for this sort of data (there are all sorts of reasons why you get correlations: antibody efficiency, growth conditions, weird unplaced bits of genome present in one set but not another...). We were digging through all these options (blacklisting, weird samples, weird bits of genome) and this hairball just was not going away. I was almost at the stage of just acknowledging and accepting the hairball, assuming it was some artefact and moving on to the more individual site correlations, when Sander came in showing that nearly all the sites were on the X chromosome.


It's easier to describe this from the perspective of what we believe is going on than to talk about feeling around the data for three months or so, trying to work it all out. The X chromosome is a sex chromosome, and males have one copy whereas females have two copies. This causes a headache for mammals in that if the X chromosome behaved "as normal", females would consistently show twice the expression of all X chromosome genes than males. 

In mammals this is "solved" by X chromosome inactivation, where one X chromosome in females is "inactivated" by quite an extreme molecular process called (unsurprisingly) X-inactivation. (Biology, in its weird and wonderful way, does this completely differently in other animal clades, eg, C.elegans or Drosophila. Go figure). This leads to a visibly compressed X chromosome in females (called the Barr body after the first person to characterise it), and this random inactivation underlies some classic phenotypes, for example tortoise shell cat (or calico cat, in US-speak) coat colouring is due to this. Multi-coloured eyes (sometimes in the same iris) in women is due the random choice of which X chromosome to inactivate. 

When you look at RNA levels in female cell lines, the vast majority of X chromosome genes have a similar expression level as males. There are exceptions (called "X-chromosome inactivation escape") and one famous, female-specific RNA - Xist - is a key molecule in X inactivation (see my previous post about the wonders of RNA).

What is CTCF doing?

All this is well established, but what did we expect to see for CTCF? CTCF is a very structural chromatin protein involved in chromatin loops. Sometimes these loops are very important for gene regulation, giving rise to CTCF's role in insulators. Sometimes there is some other looping mechanism. And there are CTCF binding sites everywhere - X chromosome included. So we thought there were three basic options for each CTCF site (there are ~1000 CTCF sites on X):

  1. CTCF site is rather like RNA - mainly suppressed, present only on the active X. If so, we'd expect males and females to have similar levels of CTCF.
  2. CTCF site is involved in X-inactivation (perhaps a bit, perhaps a lot) - If so, we'd expect there to be female-specific CTCF sites.
  3. CTCF site is oblivious to X-inactivation In this case we'd expect a ratio of 1:2 male:female.
We found sites in all 3 cases when we processed the CTCF data over the 51 samples (pretty evenly distributed male/female). We saw a lot of case 2 (CTCF is oblivous) and then quite a bit of 1 (CTCF being like RNA), and then a tiny number of 2 (CTCF involved in X-inactivation). 

Before I go any further, it's worth being sure of this result because a lot of things can happen/drive signal in these Chip-seq - or any genomics - experiment. The clincher for me was when we looked at individuals who were hetreozygous for alleles in option 1 vs option 3. For sites classified as "single active" (option 1) we see one predominant allele - consistent with just one chromosome being bound. For sites classified as "both active" (option 3) we see mainly a 50/50 split of alleles - consistent with both chromosomes being bound. 

All well and good, but we've now stumbled further into something that has been known for a long time: CTCF is a multi-functional protein involved in all sorts of things. 

Here, the process of X chromosome inactivation allows us to pull apart at least two classes (probably more things are happening within these classes as well), and thus get a large number of CTCF sites in two classes. And indeed, they look very different. One (single active) is alive with activating histone modifications and the other (both active) is pretty silent (don't know your histone modifications? Here's a cheat's guide). But the "both active" set has just as strong an overlap with conserved elements as the "single active" and, if anything, shows stronger nucleosome phasing around it (so it's definitely there). 

There were also cases with female-specific CTCF sites. These are not (as you might have thought) around the Xist locus, but instead two other non-coding RNA loci. The Chadwick group narrowed down these RNA loci on X as being interesting for different reasons, and we recapitulate their data that these RNAs are only expressed on the active X in females. Perhaps these female-specific CTCF sites are specifically wrapping up the RNA on the inactive X. Perhaps these are functional - and important for X inactivation. 

More questions than answers

We didn't take on the study to find out whether female-specific CTCF sites are wrapping up RNA on the inactive X, but in this respect our work raises more questions than it answers. Can we parse CTCF sites further out with bioinformatics? (There is a great story about CTCF site deposition by repeats - is this linked?) How conserved is the CTCF site status between close mammals and, given that we know many of them are in different locations in mouse, is there something common to these sites we should be looking at? Does anyone want to ... knock some of these sites out? Do we know how to phenotype them? Our work showed that CTCF is not the molecule involved in Barr body compaction, so... what is? 

I am, more than just a bit, tempted to go down these rabbit holes. Someone should... But I also want to stay on the path of using quantitative genetics for basic biology. Oh, the trials of having too many interesting questions.

So - this was a great paper to start my group into quantitative genetics. In many ways, the HipSci project (which I am part of) is the really well powered (many more samples) and better constructed (iPSCs rather than LCLs) - this paper is sort of a training ground for what to expect for genetic effects on chromatin, and joins a long line of RNAseq QTLs (eQTLs) and DNaseI or methylation QTLs by other groups. 

Tuesday, 28 October 2014

RNA is now a first class bioinformatics molecule.

RNA research is expanding very quickly, and a public resource for these extremely valuable datasets has been long overdue.

Some 30 years ago, scientists realised that RNA was not just an intermediary between DNA and protein (with a couple of functions on the side), but a polymer that could fold into complex shapes and catalyse countless reactions. The importance of RNA was cemented when the structure of the ribosome was determined (something that Venki Ramakrishnan, Ada E. Yonath, and Tom Steitz won the Nobel Prize for, eg here is Venki's Nobel lectureand it was confirmed that the core function of ribosomes – making a peptide bond between two amino acids – was catalysed by ribosomal RNA and not by proteins. It’s also likely that RNA – not protein, not DNA – was the first active biomolecule in the primordial soup that gave rise to Life. Indeed, one could easily see DNA as an efficient storage scheme for RNA information, and proteins as an extension of single-stranded RNA’s catalytic capabilities, enabled by the monstrous enzyme, ribosomal RNA. 

Even focusing on RNAs established role as the cell’s information carrier, the textbook mRNA, RNA-based interactions are widely recognised as being important. A real insight was the discovery of microRNA (miRNA): small RNAs whose actions lead to the down-regulation of transcripts by suppressing translation efficiency and cleaving mRNAs. MicroRNA has brought to life a whole new world of other small RNAs, many of which are involved in suppressing “genome parasites” – repeat sequences that every organism needs to manage.

And then there are long RNAs in mammalian genomes that do not encode proteins (i.e. long non-coding RNA - lincRNA) have long been recognised as having some significance – but what do they do? Some are clearly important, like the non-coding RNA poster child Xist, which inactivates one of the X chromosomes in female mammals to ensure the correct dosage of gene products. Others are involved in imprinting/epigenetic processes, for example the curiously named HOTAIR, which influences transcription on a neighbouring chromosome.

RNA: something missing

Discoveries in RNA biology have expanded the molecular biologist’s toolkit considerably in recent years. For instance, the cleavage systems from small RNAs can be used (in siRNA and shRNA ways) to knock down genes at a transcriptional level. The current “wow” technology, CRISPR/Cas9, is a bacterial phage defence system that uses an RNA-based component to adapt to new phages easily. This system has been repurposed for gene editing in (seemingly) all species – every genetics grant written these days probably has a CRISPR/Cas9 component.

And yet in terms of bioinformatics, RNA data was – until this past September – rather uncoordinated. There wasn’t a good way to talk consistently about well-known RNAs across all types, although this was sometimes coordinated in sub-fields such as Sam Griffiths-Jones’ excellent miRBase for miRNAs, or Todd Lowe’s gtRNAdb resource from for tRNAs. But because RNA data was mostly handled in one-off schemes, researchers working in this area were hindered. Computational research couldn’t progress to the next stages, for example capturing molecular function and process terms with GO or collecting protein–RNA interactions in a consistent way.

RNAcentral in the bioinformatics toolkit

So I’m delighted to see the RNAcentral project emerge ( RNAcentral is coordinating the excellent individual developments emerging in different RNA subdisciplines: miRNAs, piRNAs, lincRNAs, rRNAs, tRNAs and many more besides. It provides a common way to talk about RNA, which in turn allows other resources – such as the Gene Ontology or drug interactions databases – to slot in, usually precisely in the same “place” as the protein identifier.
Alex Bateman, who leads the RNAcentral project, has been exploring a more federated approach, quite deliberately gathering the hard-earned, community-driven expertise of member databases in specific, specialised areas of RNA biology.

RNAs were, potentially, the first things on our planet that could be considered “alive”. They are critical components in biology, not just volatile intermediaries. In terms of bioinformatics, giving RNA the same love, care and attention as proteins is long overdue, and I look forward to seeing RNAcentral provide the cohesion and stability this area of science so richly deserves. 

Monday, 29 September 2014

A cheat's guide to histone modifications

I was recently having lunch with Sandro, a charming Neapolitan computer science graduate doing a postdoc in my research group, who has a passion for great food and clean C code. We were discussing some recent aggregation results of histone modifications, and Sandro was bemoaning (verbally and non-verbally) the fact that all the histone modifications sounded "just the same". I could relate to the sentiment, recalling my own journey into this world some seven years ago during the start of the ENCODE project when I first faced this bamboozling list of modifications.

Here's my take of histone modifications. It's probably a reasonably accurate snapshot of what we knew by the end of 2013. (There is a lot more to cover, and this view will surely go out of date fairly quickly. If you are reading this post in 2016, you might want to look for your cheat sheet somewhere else!)

So - this summary is mainly for Sandro, but I am pretty sure there are others who might like to use it.


Histones are proteins that package up DNA. The combination of histones and DNA is called "chromatin", and this is the natural way one finds DNA in eukaryotic cells. Histones come as two groups of four proteins in a unit, called a nucleosome.

There are different types of histone protein. And just to be extra confusing, the same type of histone protein is sometimes made by more than one gene. (A lot of histone protein needs to be made during each cell cycle.)

Histone proteins are mainly compact, globular structures, but each one has a floppy peptide region at the start of the protein, which is described as the histone "tail" (somewhat confusingly in my view, as it's at the start of the protein! Isn't it more like a "trunk"?). These histone tails have many lysines (single amino acid code, K) which can often be modified chemically with the addition of a methyl group (CH3) or an acetyl group (CH3CO; alcohol, basically).

Modification themes

There are some strong themes that emerge in the sets of modifications, and the most important rule for recognising them is that a particular lysine can have either 1, 2 or 3 methyl groups, or one acetyl group. There are other combinations and modifications, but this rule (lysine in one of four states) is the main one.

There are many different histone proteins, and each protein has many different lysines. Because the histone modifications were first described biochemically, they had a standard naming scheme, for instance H3K4me3, or H3K27ac. The naming  scheme is:
  1. H is for histone (H3K4me3)
  2. The next number is the type of histone protein. Much of the action happens on histone 3 (H3K4me3)
  3. The next letter is amino acid that is modified. It is very often lysine: K (H3K4me3)
  4. The next number is the residue of that amino acid. Remember, the histone tail is at the N-terminus, so it is often a small number (H3K4me3)
  5. The next group is the modification. It might be me1 (mono-methylation), me2 (di-methylation), me3 (tri-methylation, as in the example) or ac (acetylation). (N.B. 'me' by itself is ambiguous, but 'ac' by itself is not.)
This is great if you are interested in the precise chemical makeup of the modifications. However, most people are interested in the implications for the cell. To be honest, I think we would be better off giving each one of these modifications a distinctive name, as they do in Drosophila gene naming schemes, but... this is what we've got. 

Histone modifications and chromatin behaviour

Histone modifications are observed across the genome, but are very different in different parts of the genome and in different cell types. There is a raging debate about whether histone modifications can be considered to drive chromatin behaviour, or whether chromatin behaviour is simply a consequence of things which are  happening nearby on the chromatin (in particular, transcription factor binding) (I am mainly a "consequence" person here). Either way, these modifications are extremely informative about what is going on in any given cell type. 

Another thing to keep in mind is that "pairs" of histone modifications can be found on the same residue. Very often, one of the pair is more about activation and the other about repression, or put another way, one features promoters and the other enhancers. Having a pair forces a sort of duality in which any particular histone has to be in one state or the other.

Well-known modifications: some notes

The H3K4me3 vs H3K4me1 pair: promoters vs intergenic

H3K4me3 is the classic histone modification, and an active mark. It was discovered early on, and is usually present on active promoters in a tight, localised area (and elsewhere, though perhaps these are mainly unannotated promoters?). H3K4me3 is your go-to histone modification to help define a promoter. 

H3K4me1 is kind of its opposite, as it is present far more in intergenic regions, including many "enhancers", through both active and inactive enhancers. H3K4me1 is more enigmatic than H3K4me3, and is a slightly less localised mark. 

H3K4me2 is (normally) tightly correlated with H3K4me3, and I think of it as really on its way of getting its third methylation to become H3K4me3.

Although H3K4me3 is activating, just to make life confusing, the other me3 modifications are, for the most part, repressive. (Biology is shameless about lack of consistency sometimes!)

The H3K27me3 vs H3K27ac pair: repressed vs active

H3K27me3 is the classic repressive histone modification. You find it in regions of the genome that are deliberately switched off. One of its most famous associations is with the polycomb complex, which is a chromatin-organising system that provides cellular memory, particularly during development.

H3K27me3 is also present on the inactive X chromosome. In common with most other repressive marks, H3K27me3 is far more diffuse, and there are mechanisms that take an initial H3K37me3 region and expand it "automatically"

In contrast, H3K27ac is a strong, "active" mark, and shows activity over both active promoters and active enhancers. As this is happening in the same residue, you can see that this system is being set up nicely to be either active or repressed. 

The H3K36me3 and H3K79me2 pair: transcriptional repression

These two marks are not opposing pairs (notice they are not on the same residue), but they do similar jobs: they essentially provide extra repression of transcription initiation in gene bodies. 

When a gene is transcribed, there is this huge, hulking protein complex (RNA polymerase) that is marching through the chromatin, making it far easier for cryptic promoters in the DNA sequence to be activated. This would lead to a mess (in particular if they were going anti-sense), except that the RNA polymerase deliberately comes to the rescue with a histone modification scheme that puts down a "don't start transcription here, because I am transcribing through this region" message: H3K36me3. This means this mark is indicative of polymerase activity, and in theory should be relatively flat across a gene body. 

H3K79me2 is the same thing, but only at the start of the gene, whereas H3K36me3 picks up after the first bit.

The H3K9me3 vs H3K9ac pair: repeat repression vs active

The H3K9me3 vs H3K9ac pairing is another repressed vs active pair, though you mostly hear about the repressive form because it is very often used on repeats. Repeats in the genome are genomic parasites which, when active, are happily copying themselves around the genome. The host genomes have some pretty ingenious ways to detect and repress these repeats. The final "effector" of repression is very often H3K9me3 deposition. 

There is also something weird going on with H3K9me3 and zinc finger genes. 

... and many more

There are many other modifications, many of which people don't actually know much about - for example the common but mysterious H4K20me1. But I'll keep this post to the classics - you can delve in to the 10s - 100s or so mysterious ones through many papers. 

Sandro - I hope this was useful :)

Tuesday, 26 August 2014

CRAM goes mainline

Two weeks ago there was the announcement from John Marshall from Sanger for SAMtools 1.0 - one of the two most widely used Next Generation Sequencing (NGS) variant-calling tools embedded in hundreds if not thousands of bioinformatics pipelines worldwide. (The majority of germline variant calling happens either through SAMtools or the Broad's GATK toolkit.) SAMtools was started at the Sanger Institute by Li Heng when he was in Richard Durbin's group, and stayed at Sanger now under the watchful eye of Thomas Keene.  

The 1.0 release of SAMtools has all sorts of good features, including more robust variant calling. But a real delight for me is the inclusion of CRAM read/write, which is a first-class format similar to SAM or BAM. SAM started this all off (hence the name samtools), and BAM was a binary and "standard" compressed form of SAM. CRAM was written to be compatible with the data model from SAM/BAM, and its main purpose is to leverage data-specific compression routines. These include reference-based compression of base sequences, based on work done by Markus Hsi-Yang Fritz and myself (blogged here, 3 years ago).

CRAM also brings in a whole set of compression tricks from James Bonfield (Sanger) and Vadim Zalunin (EMBL-EBI) that go beyond reference-based compression. James was the winner of the Pistoia Alliance's "Sequence Squeeze" competition, but sensibly said that his techniques would be best used as part of CRAM. James was instrumental in splitting out his C code into a reusable library (htslib), which is part of what SAMtools is built on. There is a whole mini-ecosystem of tools and hooks that enable reference-based compression to work, including a lightweight reference sequence server developed by Rasko Lenionen at EMBL-EBI. 

Elegant engineering

SAMtools is basically about a series of good engineers (John, James, Vadim and others) each working on different components for NGS processing, under the bonnet - with the Sanger and EMBL-EBI investing considerable effort into making the machinery come together. Just as an internal combustion engine requires more complex engineering than mixing fuel, igniting it and making a car go, good engineering takes more than a proof of principle. Really elegant engineering is invisible, and that is what SAMtools offers: it just works. It has been great to see John and James work CRAM into this indispensable piece of software, and for Sanger coordinate the project so well.

Space saving and flexibility

With this release CRAM becomes part of the natural SAMtools upgrade cycle, so when people upgrade their installation they will immediately see at least a 10% saving on disk space - if not better. If they allow the new Illumina machines to output the lower entropy quality values (this is the default), the savings will be more like 30-40%.

Another practical benefit of CRAM is its future proofing: CRAM comes "out of the box" with a variety of lossy compression techniques, and the format is quite flexible about potential new compression routines. We've seen that the main source of entropy is not the bases themselves, rather quality values on DNA sequence bases. CRAM provides options for controlled loss of precision on these qualities (something Markus and I explored in the original 2011 paper). It's important to stress that the right decision for the right level of lossy compression is best done by the scientist using and submitting the data. It might be that community standards grow up about lossy compression levels - its important to realise that in effect Illumina is already make a huge host of appropriate "data loss" decisions in their processing pipeline, most recently the shift to a reduced entropy quality score scheme. The first upgrade cycle will allow groups to mainline this and give them the option to explore appropriate levels of compression.

The Sanger Institute has been submitting CRAM since February 2014. And in preparation for the widespread use of CRAM - with the option of lossy compression - the European Nucleotide Archive has already announced that submitters can choose the level of compression and the service will archive data accordingly. Guy Cochrane, Chuck Cook and I also explored the potential community usage of compression for DNA sequencing in this paperSo we have solid engineering, a flexible technical ecosystem and have prepared for the social systems in which it will all work.

R&D ...&D ...&D

When the first CRAM submission from Sanger to EBI happened about a year ago, I blogged about how well this illustrates the difference between research and development. Our research an proof of principle implementation on data-specific compression of DNA took perhaps 1.5 FTE years of work, but to get from there to the release of SAMtools 1.0 has taken at least 8 FTE years of development. In modern biology, I think we still underestimate the amount of effort needed for effective development of a research idea for infrastructure deployment, but I sincerely hope this is changing - this wont be the last time we will need to roll in a more complex piece of engineering.

SAM, CRAM and the Global Alliance for Genomics and Health

CRAM development has come under the Global Alliance for Genomics and Health (GA4GH), a world-wide consortium of 200 organisations (including Sanger and the EBI) that aims to develop technologies for the secure sharing of genomic data. CRAM is part of the rather antiquated world in which specific file formats are needed for everything (all the cool kids focus on APIs these days). But also represents an efficient storage scheme for the more modern API vision of GA4GH. APIs need to have implementations, and the large scale "CRAM store" at the EBI provides one of the largest single instances of DNA sequence worldwide. Interestingly the container based format of CRAM has strong similarities with row-grouped, column orientated data store implementations, common in a variety of modern web technology. We will be exploring this more in the coming months.

Upgrade now, please.

The main thing for the genomics community is to now upgrade to SAMtools 1.0. This will give you many benefits in the whole short read calling and management, and one of those will be being able to read/write CRAM directly. If you do this it will have a considerable saving in disk for your institute and for us at the EBI. It will also help ensure we're all prepared for whatever volume of data you might be producing in the future.

Thursday, 10 July 2014

Scaling up bioinformatics training online

Bioinformatics has grown very quickly since the EBI opened 20 years ago, and I think it’s fair to say that it will grow even faster over the next 20 years. Biology is being transformed to a fundamentally information-centric science, and a key part of this has been the aggregation of knowledge in large-scale databases. When you put all the hard-won information about living systems together – their genome sequences, variation, proteins, interactions with small molecules – they are, potentially, incredibly useful. I say “potentially” because even the most pristine, large, interconnected data collection in the world isn’t worth much if people don’t know how to use it.
So at the EBI we have this challenge of making sure researchers (a) know we have all this amazing data for them, and (b) are able to use it. One aspect of this is making easy to use, intuitive websites, which is something I’ve blogged about before. But training, in all its forms, is really important.

A moving target

Not very surprisingly, face-to-face interactions really make the biggest difference. Nothing is better than having a person to guide you through using a resource (and making sure you’re using the right one), which is why some seven years ago we increased our training efforts substantially. We now run a huge number of courses on the Genome Campus in Cambridge and deliver an even larger number around the globe.
All this training is coordinated in one team, but of course training is embedded in all the different resource teams so the people actually leading the courses really know their stuff. I know first-hand from days as co-lead of Ensembl how effective this can be.
These courses have been taken out to over 200 sites in close to 30 countries, and they’ve reached more than 7000 people. But as I said, bioinformatics is growing fast and face-to-face training is just really hard to scale up. One way we are dealing with that is through a train-the-trainer programme, which we run in lots of different places, and while it’s effective, it’s just not enough.

Training online

So about three years ago we launched Train online, an e-learning platform set up to help molecular biologists figure out how to make the best use of our resources, dipping in as time permits. We now have 43 online courses, and over 60,000 people from close to 200 countries have visited Train online this year alone (effectively doubling the number we had this time last year).
Location of Visitors to Train Online

Good content first

These training materials are put together by teams of people who put a lot of effort into making them engaging and motivating. The way people learn individually (specifically, using online tools) is rather different from the way they learn through interacting with another person, so we try to accommodate this in a number of ways, for example making more ‘bite-sized’ courses.
Face-to-face training happens in real time and is fairly fluid, and there is a lot of preparation that goes on right up to the moment a course starts. The workflow for e-learning, on the other hand, needs constant reviewing and refreshing to stay current, so someone with adequate expertise needs to stay on it.
(As a point of interest, we assign DOIs to our courses so that the authors get recognition for their work and so we can track citations of them.)

Production value matters

As bioinformatics and computational techniques become increasingly important to more and more applied fields – healthcare, agriculture, environmental research and others – we will need to continue to innovate around how we train people. That means anything from new, effective methods for training educators to making e-learning platforms like Train online as interactive as possible.
High-quality online courses need be inviting to explore, so that you remember what you learn and are inspired to learn more. That requires significant infrastructure behind it. You need much more than just the technical capacity to set things up properly on the web – you need video equipment, editing and production workflows, video hosting and a great interface… but more than anything, you need solid in-house UX and multimedia expertise and you have to be ready to use it.

Who needs it?

A rough, back-of-the-envelope calculation estimates that there are up to 2 million life science researchers worldwide, and depending on how you count all healthcare-related research that number could go up to 4 million. If 100,000 people use Train online this year, our online learning resource alone will reach between 2% and 10% of the scientists who probably need bioinformatics training. That’s a pretty good start, but there’s a long way to go.

So if you haven’t had a look at Train online yet, please do – you might be surprised. It’s one innovation we’re particularly proud of, and we’re looking forward to seeing more in the future.