Monday, 19 January 2015

Untangling Big Data

"Big Data" is a trendy, catch-all phrase for handling large datasets in all sorts of domains: finance, advertising, food distribution, physics, astronomy and molecular biology - notably genomics. It means different things to different people, and has inspired any number of conferences, meetings and new companies. Amidst the general hype, some outstanding examples shine forth and today sees an exceptional Big Data analysis paper by a trio of EMBL-EBI research labs - Oliver Stegle, John Marioni and Sarah Teichmann - that shows why all this attention is more than just hype.

The paper is about analysing single-cell transcriptomics. The ability to measure all the RNA levels in a single cell simultaneously - and to do so in many cells at the same time - is one of the most powerful new technologies of this decade. Looking at gene regulation cell by cell brings genomics and transcriptomics closer to the world of cellular imaging. Unsurprisingly, many of the things we've had to treat as homogenous samples in the past - just because of the limitations of biochemical assays - break apart into different components at the single-cell level. The most obvious examples are tissues, but even quite "homogenous" samples separate into different constituents.

These data pose analysis challenges, the most immediate of which are technical. Single-cell transcriptomics requires quite aggressive PCR, which can easily be variable (for all sorts of reasons). The Marioni group created a model that both measures and accounts for this technical noise. But in addition to technical noise there are other large sources of variability, first and foremost of which is the cell cycle. 

Cell cycle redux

For the non-biologists reading this, cells are nearly always dividing, and when they're not they are usually paused in a specific state. Cell division is a complex dance: not only does the genome have to be duplicated, but much of the internal structure has also be split - the nucleus has to dissassemble and reassemble each time (that's just for eukaryotic cells, not bacteria). This dance has been pieced together thanks to elegant research conducted over the past 30 years in yeast (two different types), frog cells, human cells and many others. But much remains to be understood. Because cells divide multiple times, the fundamental cycle (the cell cycle) has very tightly defined stages when specific processes must happen. Much of the cell cycle is controlled by both protein regulation and gene regulation. Indeed, the whole process of the nucleus "dissolving", sister DNA chromosomes being pulled to either side, and the nucleus reassembling has a big impact on RNA levels. 

When you are measuring cells in bulk (i.e. 10,000 or more at the same time), the results will be weighted by the different 'lengths of stay' in different stages of the cell cycle. (You can sometimes synchronise the cell cycle, which is useful for research into the cell cycle, but it's hard to do routinely on any sample of interest). Now that we have single-cell measurements, which presumably tell us something about cell-by-cell variation, we also have an elephant in the room: namely, massive variation due to the cells being at different stages of the cell cycle. Bugger.

One approach is to focus on cell populations that have paused (presumably in a consistent manner) in the cell cycle, like dendritic cells. But this is limiting, and many of the more interesting processes happen during cell proliferation; for example, Sarah Teichmann's favourite process of T-cell differentiation nearly always occurs in the context of proliferating cells. If we want to see things clearly, we need to somehow factor out the cell-cycle variation so we can look at other features.

Latent variables to the rescue

Taking a step back, our task is to untangle many different sources of variation - technical noise, the cell cycle and other factors - understand them, and set them to the side. Once we do that, the interesting biology will begin to come out. This is generally how Oliver Stegle approaches most problems, in particular using Bayesian techniques to coax unknown, often complex factors (also called 'latent variables') from the data. For these techniques to work you need a lot of data (i.e. Big Data) to allow for variance decomposition, which can show how much each variable contributes to the others. 

But even the best algorithm needs good targeting. Rather than trying to learn everything at once, Oli, John and Sarah set up the method to learn the identity of cell-cycling genes from a synchronised dataset - learning both well-stablished and some anonymous genes. They brought that gene list into the context of single-cell experiments to learn the behaviour of these genes in a particular cell population, paying careful attention to technical noise. Et voilà: one can split the variation between cells into 'cell-cycle components' (in effect, assigning each cell to its cell-cycle stage), 'technical noise' and 'other variation'. 

This really changes the result. Before applying the method, the cells looked like a large, variable population. After factoring out the cell cycle, two subpopulations emerged that had been hidden by the overlay of the variable cell cycle position, cell by cell, and those two subpopulations correlated to aspects of T-cell biology. Taking it from there, they could start to to model other aspects as specific latent variables, such as T-cell differentiation.

You say confounder, I say signal

We are going to see variations on this method again and again (in my research group, we are heavy users of Oliver's latent-variable work). This variance decomposition is about splitting different components apart and showing them more clearly. If you are interested in the cell cycle, cell-cycle decomposition, or how certain details of factor changes differ between cell populations, it will be incredibly useful. If you are interested in differentiation, you can now "factor out" the cell cycle. In contrast, you might only be interested in the cell cycle and prefer to drop out other biological sources of variation. Even the technical variation is interesting if you are looking at optimising the PCR or machine conditions. "Noise" is a pejorative term here - it's all variation, with different sources and explanations. 

These techniques are not just about the cell cycle or single-cell genomics. Taken together, they represent a general mindset of isolating, understanding and ultimately modelling sources of variation in all datasets, whether they are cells, tissues, organs, whole organisms or populations. It is perhaps counter-intuitive to consider that if you have enough samples with enough homogenous dimensions (e.g. gene expression, metabolites, or other features), you can cope with data that is otherwise quite variable by splitting out the different components. 

This will be a mainstay for biological studies over this century. In many ways, we are just walking down the same road that the founders of statistics (Fisher, Pearson and others) laid down a century ago in their discussions on variance. But we are carrying on with far, far more data points and previously unimaginable abilities to compute. Big Data is allowing us to really get a grip on these complex datasets, using statistical tools, and thus to see the processes of life more clearly. 


stone said...

Sorry this is off topic but I was cheering at the radio this morning when you championed serendipitous basic research on the BBC R4 Today Program (about how CRISPR technology came about). It was brilliant how you got that crucial message into a sound bite sized slot. Everyone who loves science needs champions like you!
Cheers, Stone.

Valentine said...

Ewan, as you write, "Big Data" is certainly a trendy, catch-all phrase which might not mean much. But out of the three data sets used in the publication the largest one (the one with 3 * 96 mES cells in different states) is slightly less than 35 megabytes when loaded in to memory! It is frankly ridiculous to call 288 observations "Big Data".

Isabelle said...

Valentine, the "Big Data" does not refer to the effective size of the data files but to the number of signal you mesure in the experiment. In that case, it means 288 * ~30000 gene expression mesures!

Valentine said...

Isabelle, the post says "you need a lot of data (i.e. Big Data) to allow for variance decomposition" so it definitely refers to number of observations. And for sure, 288 is at least an order of magnitude larger than a nice normal RNA-seq study. (And two orders of magnitude larger than a typical RNA-seq study).

The problem of how to deal with far more variables than observations is an interesting one. But there are many technologies which measure a lot of variables. A decent digital camera for example measures in the order of millions of variables (pixels) for example.

When analysts for advertisements companies etc talk about "big data", it usually refers to millions of observations of thousands of variables. Difficulty lies in scalability and such.

The fact that the GPLVM described in the linked paper works well with _only_ 288 observations is what makes it impressive! It's not due to some massive collection of data. The entire point of "big data" is that simple models with a lot of data works better than complicated models with little data:

Softql said...

Very nice blog good information provide your blog Big Data for telecom

total12 said...

We need to somehow factor out the cell-cycle variation so we can look at other features.startups