Monday, 17 September 2012
Human genetics; a tale of allele frequencies, effect sizes and phenotypes
A long time ago I was on the edge of the debate about SNPs in the late 1990s; of whether there should be an investment in first discovering, and then characterising and then using many many biallelic markers to track down diseases. As is now obvious, this decision was taken (First the SNP Consortium, then the HapMap project and its successor, 1000 genomes, and then many Genome wide association studies). I was quite young at the time (in my mid to late twenties; I even had a earring at the start of this as I was a young, rebellious man) and came from a background of sequence analysis - so it was quite confusing I remember getting my head around all the different terminology and subtlies of the argument. I think it was Lon Cardon who patiently explained to me yet again the concepts and he finished by saying that the real headache was that there were just so many low frequency alleles that were going to be hidden and that was going to be a missed opportunity. I nodded at the time, adding yet one more confused concept in my head to discussions about r2 behaviours, population structure, effect size and recombination hotspots all of which didn't sit totally comfortably in my head at the time.
That debate is worth someone trying to reconstruct and write up (I wonder if those meetings are recorded somewhere) as in fact, as in many scientific debates, everyone was right at some level. For the proponents of the SNP approach, it definitely "worked" - statistically strong reproducible loci were found for many diseases. Although these days people complain about the lack of predictions from GWAS, at the time the concern was not whether there would be some missing heritability issue, but (as I remember) about whether it would work at all. It did, and in spades - just open an issue of Nature genetics. However for the people who were cautioning that there would be alot more complexity to disease - allelic hetreogenity, complex relationships between SNPs (both locally and globally) and then this curse of allele frequencies, let alone anything more complex, such as gene/environment, parent-of-origin or even epigenetic trans-generational inherietance (I list these in the rough order of my own assessment of impact; feel free to order to taste), they also are definitely proved right by our current scenario.
Remembering that young man in his mid twenties, confused by all the terms spinning around each of these pieces of complexity deserves unpicking. Allelic hetreogenity is when a locus is involved in a disease (for example, the LDL Receptor - the LDLR gene - with Familial hypercholesterolaemia), but there are many different alleles (a different mutation) often with different effects involved in the disease. This means the disease is definitely genetic; that a particular gene is definitely involved; but that no particular SNP is found at a high level associated with the disease as there are 100s or so different (probably) causitive alleles. The complex relationship between SNPs, epistasis, is both at a local ("haplotype") level where there might be a particular combination that's critical or globally. A good example of this local complexity is the study by Daniel MacArthur and colleagues where they found that a number of apparent frameshift mutations, predicted to be null alleles, were "corrected" back in frame, making (in effect) a series of protein substitution changes presumably with a far milder, if any, effect. If you try to model each variant alone here one makes very different inferences from modelling the haplotype; in theory one should try to model the whole, global genotype.
And only recently have I really come to appreciate the headaches that Lon was trying to explain to me around allele frequency. One of the early and robust predictions of population genetics, which is pretty obvious when you think about it, is that one expects an exponential decay of alleles compared to frequency in the population - ie, lots, lots more rare alleles than common alleles. This is because when a mutation happens, it must start at a ratio of 1 to "the whole size of the population" and can only grow bigger generation by generation. If the allele doesn't effect anything you can model this process very elegantly as a random walk. For starters this random walk tends to stay pretty low frequencies just because it is random, and in fact the most likely thing is that it randomly dissappears from the population. Now if this allele has a deleterious effect - which is basically what we expect for disease associated alleles - then it is even more likely to stay at a low frequency. I visualise this as the genome having a sort of series of little bubbles (variants) coming off them, and these bubbles nearly always popping straight away (variant going to zero); only rarely does a bubble get big (grow in frequency in the population). A disease effect is always pushing those bubbles associated with disease to be smaller. And - often - you can't even see the small bubbles. At the limit, every loci will have complex allele hetreogenity; the only question is how big - in both frequencies and in effect - are some of the alleles.
Having appreciated this at a far deeper level now (partly from looking at a lot more data myself) I am now even more impressed that GWAS works. For GWAS one not only needs to have a variant tagging your disease variant, but that's got to be at a reasonable enough frequency to detect something statistically - one or two individuals will not cut it. This is one of the big drivers for the large sample sizes in genome wide association studies - large sample sizes are needed just to capture enough of the minor allele of rare variants - and remember that the majority of variation is in this "rare" scenario.
But the other place where we can improve our ability to understand things was illustrated by a talk by Samuli Ripatti, working with other colleagues worldwide (including my new collaborator, Nicole Soranzo) on lipid measurements. They took a far larger set of lipid measurements than is normally done in a clinical setting with a alphabet soup of HDL and LDL sub types, along with all sort of amino acids. From this not only did they recapitulate all the existing HDL and LDL associations, but very often the specific subtypes of LDL or HDL showed far stronger effects than the composite measurements. At some sense this is no surprise - the closer one gets to measuring a biological end point of genes, the bigger effect you will see from variants, whereas more composite measurements must have more sources of variants by their very nature. And this is where all the molecular measurement techniques of chip-seq, RNA-seq, etc (exploited and explored in projects like ENCODE and others) is going to be very interesting, though we wont be able to do everything on every cell type.
So - the moral of this story is two fold. Firstly we will need large sample sizes to understand the full set of genetic effects - despite many people telling me this over the last three or four years, it only really "clicked" in my head in the last 6 months. Secondly we need to raise our (collective) game in phenotyping, and not just molecular phenotyping, or cellular, or endo, or disease - but all types of phenotyping, as the closer we can get to the genotype from the phenotype end, the better powered we are.
And many, many groups worldwide are getting stuck into this, telling me that we have at least another decade's worth of discovery coming from relatively "straightforward" (in concepts, though not in practice, logistics, sequencing or analysis!) human genetics.
Sunday, 9 September 2012
Response on ENCODE reaction
The publication of ENCODE data raised substantial discussions. Clear, open, rational debate with access to data is the
cornerstone of science. For the scientific details the ENCODE papers are
totally open, and we have aimed for a high level of transparency e.g. a virtual
machine to provide complete access to data and code.
There is an important discussion
– which no doubt will continue throughout this decade – about the correspondence
between reproducible biochemical events on the genomes, their downstream cellular
and organismal functions, their selection patterns in evolution and their roles
in disease. ENCODE provides a substantial new dataset for this discussion, not
some definitive answer, and is part of a longer arc of science in this general
area. I touch on this on my blog
There are also
"meta" questions concerning the balance of "big" and
"small" science, and how "big" science projects should be conducted.
The Nature commentary I wrote focuses
on this.
ENCODE also had the
chance of making our results comprehensible to the general public: those who
fund the work (the taxpayers) and those who may benefit from these discoveries
in the future. To do this we needed to reach out to journalists and help them
create engaging stories for their
readers and viewers, not for the readers of Nature
or Science. For me, the driving
concern was to avoid over-hyping the medical applications, and to emphasize that
ENCODE is providing a foundational resource akin to the human genome.
With hindsight, we
could have used different terminology to convey the concepts, consequence and
massive extent of genomic events we observed. (Note to self: one can be precise
about definitions in paper or a scientific talk to scientists, but it’s far
harder via the medium of everyday press, even to the same audience). I do think
we got our point to the general public: that there is a staggering amount of
activity in the genome, and that this opens up a lot of sophisticated and highly
relevant scientific questions. There was a considerable amount of positive mainstream
press, sometimes quite nuanced. Hindsight is a cruel and wonderful thing, and
probably we could have achieved the same thing without generating this unneeded,
confusing discussion on what we meant and how we said it.
I am tremendously
proud of the way that the consortium worked together and created the resources
that it did. The real measure of a foundational resource such as ENCODE is not
the press reaction, nor the papers, but the use of its data by many scientists
in the future.
Wednesday, 5 September 2012
ENCODE: My own thoughts
5 September 2012 - Today sees the embargo lift on the
second phase of the ENCODE project and the simultaneous publication of 30
coordinated, open-access papers in Nature,
Genome Research and Genome Biology as well as publications
in Science, Cell, JBC and others. The
Nature publication has a number of
firsts: cross-publication topic threads, a dedicated iPad/eBook App and web site and a
virtual machine.
This ENCODE event represents five years of
dedicated work from over 400 scientists, one of whom is myself, Ewan Birney. I
was the lead analysis coordinator for ENCODE for the past five years (and
before that had effectively the same role in the pilot project) and for the past
11 months have spent a lot of time working up to this moment. There were
countless details to see to for the scientific publications and, later, to
explain it all in editorials, commentary, general press features and other exotic
things.
But in telling the story over and over,
only parts of it get picked up here and there – the shiny bits that make a neat
story for one audience or another. Here I’d like to add my own voice, and to tell
at least one person’s perspective of the ENCODE story uncut, from beginning to
end.
This blog post is primarily for scientists,
but I hope it is of interest to other people as well. Inspired by some of my
more sceptical friends (you know who you are!), I’ve arranged this as a kind of
Q&A.
Q. Isn’t
this a lot of noise about publications when it should be about the data?
A. You are absolutely right it’s about the
data – ENCODE is all about the
data being used widely. This is what we say in the conclusions
of the main paper: “The
unprecedented number of functional elements identified in this study provides a
valuable resource to the scientific community…” We focused on providing not
only raw data but many ways to get to it and make sense of it using a variety
of intermediate products: a virtual machine (see below), browse-able resources that
can be accessed from www.encodeproject.org and the UCSC and Ensembl browsers (and soon NCBI browsers), and a new
transcription-factor-centric resource, Factorbook. As I say in a Nature commentary, “The overall
importance of consortia science can not be assessed until years after the data
are assembled. But reference data sets are repeatedly used by numerous
scientists worldwide, often long after the consortium disbands. We already know
of more than 100 publications that make use of ENCODE data, and I expect many more
in the forthcoming years.”
Q. Whatever
– you love having this high-profile publication.
A. Of course I like
the publications! Publications are the best way for scientists to communicate with
each other, to explain key aspects of the data and draw some conclusions from
them. But the impact of the project goes well beyond the publications
themselves. While it is nice to see so much focus on the project, publishing is
simply part of disseminating information and making the data more accessible.
Q. And 442 authors! Did they all really
contribute to this?
A. Yes. I know a large proportion of them personally, and for the ones I
don’t know, I know and trust the lead principal investigators who have
indicated who was involved in this. To achieve systematic data generation on
this scale – in particular to achieve the consistency – is a large, detailed
task. Many of the other 30 papers – and many others to be published – go into
specific areas in increasing levels of detail.
One group which I
believe gets less credit than they deserve are the lead data production
scientists; usually an individual with a PhD who heads up, motivates and
trouble shoots the work of a dedicated group of technicians. There is a simple
sentence in the paper: “For consistency, data were generated and processed
using standardized guidelines, and for some assays, new quality-control
measures were designed”. This hides a world of detailed, dedicated work.
There is no way to
truly weigh the contribution of one group of scientists compared to another in
a paper such as this; many individuals would satisfy the deletion test of “if
this person’s work was excluded, would the paper have substantially changed”.
However, two individuals stood out for their overall coordination and analysis,
and 21 individuals in this data production area, including the key role of the
Data Coordination Center.
Q. Hmmm. Let’s move onto the science. I
don’t buy that 80% of the genome is functional.
A. It’s clear that 80%
of the genome has a specific biochemical activity – whatever that might be. This
question hinges on the word “functional” so let’s try to tackle this first.
Like many English language words, “functional” is a very useful but context-dependent
word. Does a “functional element” in the genome mean something that changes a
biochemical property of the cell (i.e.,
if the sequence was not here, the biochemistry would be different) or is it
something that changes a phenotypically observable trait that affects the whole
organism? At their limits (considering all the biochemical activities being a
phenotype), these two definitions merge. Having spent a long time thinking
about and discussing this, not a single definition of “functional” works for
all conversations. We have to be precise about the context. Pragmatically, in
ENCODE we define our criteria as “specific biochemical activity” – for example,
an assay that identifies a series of bases. This is not the entire genome (so,
for example, things like “having a phosphodiester bond” would not qualify). We
then subset this into different classes of assay; in decreasing order of
coverage these are: RNA, “broad” histone modifications, “narrow” histone
modifications, DNaseI hypersensitive sites, Transcription Factor ChIP-seq
peaks, DNaseI Footprints, Transcription Factor bound motifs, and finally Exons.
Q. So remind me which one do you think is
“functional”?
A. Back to that word “functional”: There is
no easy answer to this. In ENCODE we present this hierarchy of assays with
cumulative coverage percentages, ending up with 80%. As I’ve pointed out in
presentations, you shouldn’t be surprised by the 80% figure. After all, 60% of
the genome with the new detailed manually reviewed (GenCode) annotation is
either exonic or intronic, and a number of our assays (such as PolyA- RNA, and
H3K36me3/H3K79me2) are expected to mark all active transcription. So seeing an
additional 20% over this expected 60% is not so surprising.
However, on the other end of the scale – using
very strict, classical definitions of “functional” like bound motifs and DNaseI
footprints; places where we are very confident that there is a specific DNA:protein
contact, such as a transcription factor binding site to the actual bases – we
see a cumulative occupation of 8% of the genome. With the exons (which most
people would always classify as “functional” by intuition) that number goes up to
9%. Given what most people thought earlier this decade, that the regulatory
elements might account for perhaps a similar amount of bases as exons, this is
surprisingly high for many people – certainly it was to me!
In addition, in this phase of ENCODE we did
sample broadly but nowhere near completely in terms of cell types or
transcription factors. We estimated how well we have sampled, and our most
generous view of our sampling is that we’ve seen around 50% of the elements. There
are lots of reasons to think we have sampled less than this (e.g., the inability to sample
developmental cell types; classes of transcription factors which we have not
seen). A conservative estimate of our expected coverage of exons + specific
DNA:protein contacts gives us 18%, easily further justified (given our
sampling) to 20%
Q. [For
the more statistically minded readers]: What about the whole headache of
thresholding your enrichments? Surely this is a statistical nightmare across
multiple assays and even worse with sampling estimates.
A. It is a bit of a
nightmare, but thankfully we had a really first class non-parametric
statistical group (the Bickel group) who developed a robust, non-parametric (so
it makes minimal assumption about distribution), conservative statistic based
on reproducibility (IDR). This is not perfect. Being conservative if one
replicate has far better signal-to-noise than the other, it stops calling on
the onset of noise in the noisiest replicate, but this is generally a
conservative bias. And for the sampling issues, we explored different
thresholds and looked at saturation when we were relaxed on thresholds and then
shifted to being conservative. Read the supplementary information and have a
ball.
Q. [For
50% of the readers]: Ok, I buy the 20% of the genome is really doing
something specific. In fact, haven’t a lot of other people suggested this?
A. Yes. There have
been famous discussions about how regulatory changes – not protein changes –
must be responsible for recent evolution, and about other locus assays
(including about 10 years of RNA surveys). But ENCODE has delivered the most
comprehensive view of this to date.
Q. [For
the other 50% of readers]: I still don’t buy this. I think the majority of
this is “biological noise”, for instance binding that doesn’t do anything.
A. I really hate the
phrase “biological noise” in this context. I would argue that “biologically
neutral” is the better term, expressing that there are totally reproducible,
cell-type-specific biochemical events that natural selection does not care
about. This is similar to the neutral theory of amino acid evolution, which
suggests that most amino acid changes are not selected either for or against. I
think the phrase “biological noise” is best used in the context of stochastic
variation inside a cell or system, which is sometimes exploited by the organism
in aspects of biology, e.g. signal
processing.
It’s useful to keep
these ideas separate. Both are due to stochastic processes (and at some level
everything is stochastic), but these biological neutral elements are as
reproducible as the world’s most standard, developmentally regulated gene. Whichever
term you use, we can agree that some of these events are “neutral” and are not
relevant for evolution. This is consistent with what we’ve seen in the ENCODE
pilot and what colleagues such as Paul Flicek and Duncan Odom have seen in
elegant experiments directly tracking transcription factor binding across species
Q. Ok, so why don’t we
use evolutionary arguments to define “functional”, regardless of what evolution
‘cares about’? Isn’t this 5% of the human genome?
A. Anything under
negative selection in the human population (i.e. recent human evolution) is
definitely functional. However, even with this stated criteria, it is very hard
to work out how many bases this is. The often-quoted “5%”, which comes from the
mouse genome paper, is actually the fitting of two Gaussians that look at the
distribution of conservation between human and mouse in 50bp windows. We’ve
been referring to 5% of those 50bp windows.
When you consider the
number of bases being conserved this must be lower than this as we don’t expect
100% of the bases in these 50bp windows to be conserved. However, this only
about pan-mammalian constraint, and we are interested in all constraint in the
human genome, including the lineage specific elements, so this estimate just
provides a floor to the numbers. The end result is that we don’t actually have
a tremendously good handle on the number of bases under selection in humans.
Some have tried other estimates
of negative selection, trying to get a handle on the more recent evolution. I
particularly like Gerton Lunter’s and Chris Ponting’s estimates (published in Genome Research), which give a window of
between 10% to 15% of the bases in the human genome being under selection –
though I note some people dispute their methodology.
By identifying those regions
likely to be under selection (because they have specific biochemical activity) in
an orthogonal, experimental manner, ENCODE substantially adds to this debate.
By identifying isolated, primate-specific insertions (where we can say with
confidence that the sequence is unique to primates), we could contrast the bases
inside ENCODE-identified regions with those outside. As ENCODE data covers the
genome, we now have enough statistical power to look at the derived allele
frequency (DAF) spectrum of SNPs in the human population. The SNPs inside
ENCODE regions show more very low frequency alleles than the SNPs outside (accurate
genome-wide frequencies due to the 1000 Genomes Project), which is a
characteristic sign of negative selection and is not influenced by confounders
such as mutation rate of the sequence (see Figure 1 of the main ENCODE paper).
We can do that across
all of ENCODE, or break it down by broad sub-classification. Across all sub-classifications
we see evidence of negative selection. Sadly, it is not trivial to estimate the
proportion of bases from derived allele frequency spectra that are under
selection, and the numbers are far more slippery than one might think. Over the
next decade there will, I think, be much important reconciliation work, looking
at both experimental and evolutionary/population aspects (bring on the million-person
sequencing dataset!).
Q. So
– we’re really talking about things under negative selection in human – is that
our final definition of “functional”?
A. If it is under negative selection in the human population, for me it
is definitely functional.
I, and other
people, do think we need to be open to the possibility of bases that definitely
effect phenotypes but are not under negative selection –both disease related
phenotypes and other normal phenotypes. My colleague Paul Flicek uses the shape
of the nose as an example; quite possibly the different nose shapes are not
under selection – does that mean we’re not interested in this phenotype?
Regardless of
all that, we really do need a full, cast-iron set of bases under selection in humans
– this is a baseline set.
Q. Do
you really need ENCODE for this?
A. Yes. Imagine that THE
set of bases under selection in the human genome were dropped in your lap by
some passing deity. Wonderful! But you would still want to know the how and
why. ENCODE is the starting place to answer the biochemical “how”. And given that
passing deities are somewhat thin on the ground, we should probably go ahead
and figure out models of how things work so that we can establish this set of
bases. I am particularly excited about the effectiveness of using position–weight
matrices in the ENCODE analyses (my postdoc Mikhail Spivakov did a nice piece
of work here).
Q. Ok,
fair enough. But are
you most comfortable with the 10% to 20% figure for the hard-core functional
bases? Why emphasize the 80% figure in the abstract and press release?
A. (Sigh.) Indeed. Originally I pushed for using an “80% overall” figure and
a “20% conservative floor” figure, since the 20% was extrapolated from the
sampling. But putting two percentage-based numbers in the same breath/paragraph
is asking a lot of your listener/reader – they need to understand why there is
such a big difference between the two numbers, and that takes perhaps more
explaining than most people have the patience for. We had to decide on a
percentage, because that is easier to visualize, and we choose 80% because (a)
it is inclusive of all the ENCODE experiments (and we did not want to leave any
of the sub-projects out) and (b) 80% best coveys the difference between a
genome made mostly of dead wood and one that is alive with activity. We refer also
to “4 million switches”, and that represents the bound motifs and footprints.
We use the bigger number
because it brings home the impact of this work to a much wider audience. But we
are in fact using an accurate, well-defined figure when we say that 80% of the
genome has specific biological activity.
Q. I get really annoyed with papers like ENCODE
because it is all correlative. Why don’t people own up to that?
A. It is mainly
correlative, and we do own up to it. (We did do number of specific experiments
in a more “testing” manner – in particular I like our mouse and fish
transgenics, but not for everything.) For example, from the main paper: “This
is an inherently observational study of correlation patterns, and is consistent
with a variety of mechanistic models with different causal links between the
chromatin, transcription factor and RNA assays. However, it does indicate that
there is enough information present at the promoter regions of genes to explain
most of the variation in RNA expression.”
Interestingly enough,
we had quite long debates about language/vocabulary. For example, when we built
quantitative models, to what extent were we allowed to use the word “predict”?
Both the model framework and the precise language used to describe the model
imply a sort of causality. Similarly, we describe our segmentation-based
results as finding states “enriched in enhancers”, rather than saying that we
are providing a definition of an enhancer. Words are powerful things.
Q. I am still skeptical. What new insights does
ENCODE offer, and are they really novel? Most of the time I think someone has
already seen something similar before.
A. I think that the
scale of ENCODE – in particular the diversity of factors and assays – is
impressive, and although correlative, this scale places some serious
constraints on models. For example, the high, quantitative correlation between
CAGE tags and histone marks at promoters limits the extent to which RNA
processing changes RNA levels. (This is measured by 5’ ends – n.b. if there is a
considerable amount of aborted transcription generating 5’ends, this need not
mean full transcripts, though this correlation is high both for nuclear
isolated 5’ends and cytoplasmic isolated 5’ ends.)
As for “someone has
discovered it already,” I agree that the vast majority of our insights and
models are consistent with at least one published study – often on a specific
locus, sometimes not in human. Indeed, given the 30 years of study into
transcription, I am very wary of
putting forward concepts that don’t have support from at least some individual
loci studies.
ENCODE has been
selecting/confirming hypotheses that are broadly genome-wide, or multi-cell
line true. ENCODE is a different beast from focused, mechanistic studies, which
often (and rightly) involve precise perturbation experiments. Both the broader
studies and the more focused studies help define phenomena such as
transcription and chromatin dynamics.
This is all in the
main paper, but then the network paper (led by Mike Synder and Mark Gerstein)
on transcription factor co-binding, the open chromatin distribution paper (led
by Greg Crawford, Jason Lieb, John Stamatoyannopolus), the DNaseI distribution
paper (led by John Stamatoyannopolus), the RNA distribution and processing
paper (led by Roderic Guigo and Tom Gingeras) and chromatin confirmation paper
(led by Job Dekker) all provide non-obvious insights into how different
components interact. And that’s just the Nature papers – there are
another 30-odd papers to read. (We hope our new publishing innovation –
“threads” – will help you navigate easily to the parts of all these papers you
are most interested in reading.)
Q. You talk about how this will help medicine,
but I don’t see this being directly relevant?
A. ENCODE is a foundational data set – a layer on top of the human genome –
and its impact will be to make basic and applied research work faster and more
cheaply. Because of our systematic, genome-wide approach, we’ve been able to
deliver essential, high-quality reference material for smaller groups working
on all manner of diseases. And in particular the overlap to genomewide
association studies (GWAS) has been a very informative analysis.
Q. Moving to the disease genetics, were you surprised
at this correlation with GWAS, as the current GWAS catalog is about lead
association study SNPs, and we don’t expect this to overlap with functional
data.
A. This was definitely
a surprise to us. When I first saw this result I thought there was something
wrong with some aspect of the analysis! The raw enrichment of GWAS-lead SNPs
compared to baseline SNPs (e.g. those from the 1000 Genomes Project) is very
striking, and yet if the GWAS-lead SNPs are expected to be tagging (but not
coincident) with a functional variant, you would expect little or no
enrichment.
We ended up with four
groups implementing different approaches here, and all of them found the same
two results. First, that the early SNP genotyping chips are quite biased
towards functional regions. By talking to some of the people involved in those
early designs (ca. 2003), I learned some of this is deliberate, for instance
favouring SNPs near promoters. But even if you model this in, the enrichment of
GWAS SNPs over a null set of matched SNPs is still there. This is similar to
that card in Monopoly: “Bank Error in your favour; please collect 10
Euros/Dollars/Pounds”. In this case, it is: “Design bias in your favour; you
will have more functional variants identified in the first screen than you
think”.
We think that around 10% to 15% of GWAS
“lead” loci are either the actual functional SNP in the condition studied or
within 200bp of the functional variant. This is all great, but we can now do
something really brilliant: break down
this overall enrichment by phenotypes (from GWAS) and by functional type, in
particular cell type (DNaseI) or transcription factor (TF). This matrix has a
number of significant enrichments of particular phenotypes compared with
factors or cell types. Some of these we understand well (e.g., Crohn’s disease and T-Helper cells); some of these
enrichments are perfectly credible (e.g., Crohn’s disease and GATA-factor
transcription factors); and some are a bit of a head-scratcher.
But the great thing about our data is that
we didn’t have to choose a specific cell type to test or a particular disease.
By virtue of being able to map both diseases and cell-specific (or
transcription-factor-specific) elements to the genome, we can look across all
possibilities. This will improve as we get more transcription factors and as we
get better “fine mapping” of variants. This result for me alone is totally
exciting: it’s very disease-relevant,
and it leverages the unbiased, open, genome-wide nature of both ENCODE and GWAS
studies to point to new insights for disease.
Q. You make a fuss about these new publishing
aspects, such as “threads”. Should I be excited?
A. I hope so! The idea of threads is a novel attempt by us to help readers
get the most out of this body of coordinated scientific work. Say you are only
interested in a particular topic – say, enhancers – but you know that different
groups in ENCODE are likely to have mentioned this (in particular the technical
papers in Genome Research and Genome Biology). Previously you
would have had to skim the abstract or text of all 30 papers to try and work
out which ones were really relevant. Threads offer an alternative, lighting up
a path through the assembled papers, pointing out the figures and paragraphs
most relevant to any of 13 topics and taking you all the way through to the
original data. The threads are there to help you discover more about the
science we’ve done, and about the ENCODE data. Interestingly, this is something
that’s only achievable in the digital form, and for the first time I found
myself being far more interested in how the digital components work than in the
print components.
The idea of threads
came from the consortium, but the journal editors, in particular Magdalena
Skipper from Nature made it a reality – remember that in these threads we are
crossing publishing house boundaries. The resulting web site and iPad App I
think works very well. I am going to be interested to see how other scientists
react to this.
Q. And what about this Virtual Machine. Why is
this interesting?
A. An innovation in computing over the last decade has been the use of
virtualization, where the whole state of a computer can be saved to a file, and
transported to another “host” computer and then restarted. This has given us a
new opportunity to increase transparency for data intensive science.
Many people have noted that complex computational methods are very hard
to track in all their detail. We currently place a lot of trust in each other
as scientists that phrases such as “we then removed outliers” or “we normalised
using standard methods” are executed appropriately. The ENCODE virtual machine
provides all these complex details explicitly in a package that
will last at least as long as the open virtualization format we use (OVF,
VirtualBox). So if you are a computational biologist in three years’ time, and
you want to see the precise details of how we achieved something, you can run
the analysis codes yourself. The only caveat to this is that for the large,
compute-scale pipelines we have an exemplar processing step, and then have the
results of this parallelised (i.e. we
do not have a virtualised pipelines). Think of this a bit like the ultimate
materials and methods section of the paper. I believe this virtual
machine substantially increases the transparency of this data-intensive
science, and that we should produce virtual machines in the future for all
data-intensive papers.
Q. I’ve read
your Nature commentary about large projects, and admit that
I'm uneasy about how these large projects throw their weight around. Isn’t there more friction and angst than
you admit to?
A. There is indeed friction and angst, in
particular with the smaller groups (“hypothesis testing groups, or R01 groups”)
close to the scientific areas of ENCODE. I regret every instance of this and
have tried my best to make things work out. After a lot of experience,
I’ve realised a couple of things: Like any large beast, projects like ENCODE
can inadvertently cause headaches for smaller groups. Part of this is actually
due to third parties, for example reviewers of papers or grants who mistakenly
think that the large datasets in ENCODE somehow replace or make redundant more
focused studies. This is rarely the case – what the large project provides is a
baseline dataset that is useful mainly for people who don’t have the time or
inclination to do such a study and, importantly, who would not find it
practical to do this work systematically (i.e.
cutting to established, promising focus areas). ENCODE’s target audience
is someone who needs this systematic approach, for example clinical researchers
who might scan their (putatively) causative alleles or somatic variants against
such a catalog. ENCODE does not replace the targeted perturbation experiment,
which illuminates some aspect of chromatin or transcriptional mechanism
(sometimes in a particular disease context). However, people less involved in
this work can make the mistake of lumping together the mechanistic study and
the catalog building as “doing ChIP-seq”, and assume they are redundant. As
scientists in this area, both large and small groups need to regularly point
out their explicit and non-overlapping complementarity.
Also, compared to some
other scientific fields, genomics has a remarkably positive track record in
data sharing and communication. We can do far better (more below), but everyone
should be mindful that for all our faults, we do share datasets completely and
openly, we nearly always share resources and techniques and we do communicate. Non-genomicists
would be surprised sometimes at the depths of distrust in other
fields. That said, there is always room for improvement. Although we did
use pre-publication raw data sharing in ENCODE, we should have spent more time
and effort sharing intermediate datasets (in addition to raw datasets). The
1000 Genomes Project provides an excellent example to follow.
Finally, I believe that
the etiquette-based system of how to handle pre-publication data release (and I
was a prominent participant in this discussion) is clumsy and out-moded: designed for a world where data generation -
not analysis - is the bottleneck. I believe we need to have a new scheme. I'm
not rushing to state my own opinion here - we need to have a deliberative
process that balances getting broad buy-in and ideas with a timely and
practical result.
Q. So ENCODE is all
done now, right?
A. Nope! ENCODE “only” did 147 cell types and 119
transcription factors, and we need to have a baseline understanding of every
cell type and transcription factor. Thankfully, NHGRI has approved the idea of
pushing for this – not an unambitious task – over the next 5 years. I see there
being three phases of ENCODE: the ENCODE Pilot (1% of the genome); the ENCODE
scale-up (or production), where we showed that we can work at this scale and
analyse the data sensibly; and next the ENCODE phase “build-out” to all cell
types and factors.
Q. So you get to do
this for another five years?
A. Someone does. I have hung up my ENCODE
“cat-herder-in-chief” hat, and moved onto new things, like the equally
challenging world of delivering a pan-European bioinformatics infrastructure
(ELIXIR). But that’s for another blog post!
Q. Be honest. Will
you miss it?
A. Looking back on my
ten years with ENCODE, you know, I really am going to miss this. (Okay, maybe I
won't miss three-hour teleconferences running to 2am...). It has been hard work
and excellent science – I’ve met and interacted with so many great scientists
and have honestly had a lot of fun.
Subscribe to:
Posts (Atom)
