Wednesday, 5 September 2012

ENCODE: My own thoughts


5 September  2012 - Today sees the embargo lift on the second phase of the ENCODE project and the simultaneous publication of 30 coordinated, open-access papers in Nature, Genome Research and Genome Biology as well as publications in Science, Cell, JBC and others. The Nature publication has a number of firsts: cross-publication topic threads, a dedicated iPad/eBook App and web site and a virtual machine.

This ENCODE event represents five years of dedicated work from over 400 scientists, one of whom is myself, Ewan Birney. I was the lead analysis coordinator for ENCODE for the past five years (and before that had effectively the same role in the pilot project) and for the past 11 months have spent a lot of time working up to this moment. There were countless details to see to for the scientific publications and, later, to explain it all in editorials, commentary, general press features and other exotic things.

But in telling the story over and over, only parts of it get picked up here and there – the shiny bits that make a neat story for one audience or another. Here I’d like to add my own voice, and to tell at least one person’s perspective of the ENCODE story uncut, from beginning to end.

This blog post is primarily for scientists, but I hope it is of interest to other people as well. Inspired by some of my more sceptical friends (you know who you are!), I’ve arranged this as a kind of Q&A.

Q. Isn’t this a lot of noise about publications when it should be about the data?
A. You are absolutely right it’s about the data – ENCODE is all about the data being used widely. This is what we say in the conclusions of the main paper: “The unprecedented number of functional elements identified in this study provides a valuable resource to the scientific community…” We focused on providing not only raw data but many ways to get to it and make sense of it using a variety of intermediate products: a virtual machine (see below), browse-able resources that can be accessed from www.encodeproject.org and the UCSC and Ensembl browsers (and soon NCBI browsers), and a new transcription-factor-centric resource, Factorbook. As I say in a Nature commentary, “The overall importance of consortia science can not be assessed until years after the data are assembled. But reference data sets are repeatedly used by numerous scientists worldwide, often long after the consortium disbands. We already know of more than 100 publications that make use of ENCODE data, and I expect many more in the forthcoming years.”

Q. Whatever – you love having this high-profile publication.
A. Of course I like the publications! Publications are the best way for scientists to communicate with each other, to explain key aspects of the data and draw some conclusions from them. But the impact of the project goes well beyond the publications themselves. While it is nice to see so much focus on the project, publishing is simply part of disseminating information and making the data more accessible.

Q. And 442 authors! Did they all really contribute to this?
A. Yes. I know a large proportion of them personally, and for the ones I don’t know, I know and trust the lead principal investigators who have indicated who was involved in this. To achieve systematic data generation on this scale – in particular to achieve the consistency – is a large, detailed task. Many of the other 30 papers – and many others to be published – go into specific areas in increasing levels of detail.

One group which I believe gets less credit than they deserve are the lead data production scientists; usually an individual with a PhD who heads up, motivates and trouble shoots the work of a dedicated group of technicians. There is a simple sentence in the paper: “For consistency, data were generated and processed using standardized guidelines, and for some assays, new quality-control measures were designed”. This hides a world of detailed, dedicated work.

There is no way to truly weigh the contribution of one group of scientists compared to another in a paper such as this; many individuals would satisfy the deletion test of “if this person’s work was excluded, would the paper have substantially changed”. However, two individuals stood out for their overall coordination and analysis, and 21 individuals in this data production area, including the key role of the Data Coordination Center.

Q. Hmmm. Let’s move onto the science. I don’t buy that 80% of the genome is functional.
A. It’s clear that 80% of the genome has a specific biochemical activity – whatever that might be. This question hinges on the word “functional” so let’s try to tackle this first. Like many English language words, “functional” is a very useful but context-dependent word. Does a “functional element” in the genome mean something that changes a biochemical property of the cell (i.e., if the sequence was not here, the biochemistry would be different) or is it something that changes a phenotypically observable trait that affects the whole organism? At their limits (considering all the biochemical activities being a phenotype), these two definitions merge. Having spent a long time thinking about and discussing this, not a single definition of “functional” works for all conversations. We have to be precise about the context. Pragmatically, in ENCODE we define our criteria as “specific biochemical activity” – for example, an assay that identifies a series of bases. This is not the entire genome (so, for example, things like “having a phosphodiester bond” would not qualify). We then subset this into different classes of assay; in decreasing order of coverage these are: RNA, “broad” histone modifications, “narrow” histone modifications, DNaseI hypersensitive sites, Transcription Factor ChIP-seq peaks, DNaseI Footprints, Transcription Factor bound motifs, and finally Exons.

Q.  So remind me which one do you think is “functional”?
A. Back to that word “functional”: There is no easy answer to this. In ENCODE we present this hierarchy of assays with cumulative coverage percentages, ending up with 80%. As I’ve pointed out in presentations, you shouldn’t be surprised by the 80% figure. After all, 60% of the genome with the new detailed manually reviewed (GenCode) annotation is either exonic or intronic, and a number of our assays (such as PolyA- RNA, and H3K36me3/H3K79me2) are expected to mark all active transcription. So seeing an additional 20% over this expected 60% is not so surprising.

However, on the other end of the scale – using very strict, classical definitions of “functional” like bound motifs and DNaseI footprints; places where we are very confident that there is a specific DNA:protein contact, such as a transcription factor binding site to the actual bases – we see a cumulative occupation of 8% of the genome. With the exons (which most people would always classify as “functional” by intuition) that number goes up to 9%. Given what most people thought earlier this decade, that the regulatory elements might account for perhaps a similar amount of bases as exons, this is surprisingly high for many people – certainly it was to me!

In addition, in this phase of ENCODE we did sample broadly but nowhere near completely in terms of cell types or transcription factors. We estimated how well we have sampled, and our most generous view of our sampling is that we’ve seen around 50% of the elements. There are lots of reasons to think we have sampled less than this (e.g., the inability to sample developmental cell types; classes of transcription factors which we have not seen). A conservative estimate of our expected coverage of exons + specific DNA:protein contacts gives us 18%, easily further justified (given our sampling) to 20%

Q. [For the more statistically minded readers]: What about the whole headache of thresholding your enrichments? Surely this is a statistical nightmare across multiple assays and even worse with sampling estimates.
A. It is a bit of a nightmare, but thankfully we had a really first class non-parametric statistical group (the Bickel group) who developed a robust, non-parametric (so it makes minimal assumption about distribution), conservative statistic based on reproducibility (IDR). This is not perfect. Being conservative if one replicate has far better signal-to-noise than the other, it stops calling on the onset of noise in the noisiest replicate, but this is generally a conservative bias. And for the sampling issues, we explored different thresholds and looked at saturation when we were relaxed on thresholds and then shifted to being conservative. Read the supplementary information and have a ball.

Q. [For 50% of the readers]: Ok, I buy the 20% of the genome is really doing something specific. In fact, haven’t a lot of other people suggested this?
A. Yes. There have been famous discussions about how regulatory changes – not protein changes – must be responsible for recent evolution, and about other locus assays (including about 10 years of RNA surveys). But ENCODE has delivered the most comprehensive view of this to date.

Q. [For the other 50% of readers]: I still don’t buy this. I think the majority of this is “biological noise”, for instance binding that doesn’t do anything.
A. I really hate the phrase “biological noise” in this context. I would argue that “biologically neutral” is the better term, expressing that there are totally reproducible, cell-type-specific biochemical events that natural selection does not care about. This is similar to the neutral theory of amino acid evolution, which suggests that most amino acid changes are not selected either for or against. I think the phrase “biological noise” is best used in the context of stochastic variation inside a cell or system, which is sometimes exploited by the organism in aspects of biology, e.g. signal processing.

It’s useful to keep these ideas separate. Both are due to stochastic processes (and at some level everything is stochastic), but these biological neutral elements are as reproducible as the world’s most standard, developmentally regulated gene. Whichever term you use, we can agree that some of these events are “neutral” and are not relevant for evolution. This is consistent with what we’ve seen in the ENCODE pilot and what colleagues such as Paul Flicek and Duncan Odom have seen in elegant experiments directly tracking transcription factor binding across species 

Q. Ok, so why don’t we use evolutionary arguments to define “functional”, regardless of what evolution ‘cares about’? Isn’t this 5% of the human genome?
A. Anything under negative selection in the human population (i.e. recent human evolution) is definitely functional. However, even with this stated criteria, it is very hard to work out how many bases this is. The often-quoted “5%”, which comes from the mouse genome paper, is actually the fitting of two Gaussians that look at the distribution of conservation between human and mouse in 50bp windows. We’ve been referring to 5% of those 50bp windows.

When you consider the number of bases being conserved this must be lower than this as we don’t expect 100% of the bases in these 50bp windows to be conserved. However, this only about pan-mammalian constraint, and we are interested in all constraint in the human genome, including the lineage specific elements, so this estimate just provides a floor to the numbers. The end result is that we don’t actually have a tremendously good handle on the number of bases under selection in humans.

Some have tried other estimates of negative selection, trying to get a handle on the more recent evolution. I particularly like Gerton Lunter’s and Chris Ponting’s estimates (published in Genome Research), which give a window of between 10% to 15% of the bases in the human genome being under selection – though I note some people dispute their methodology.

By identifying those regions likely to be under selection (because they have specific biochemical activity) in an orthogonal, experimental manner, ENCODE substantially adds to this debate. By identifying isolated, primate-specific insertions (where we can say with confidence that the sequence is unique to primates), we could contrast the bases inside ENCODE-identified regions with those outside. As ENCODE data covers the genome, we now have enough statistical power to look at the derived allele frequency (DAF) spectrum of SNPs in the human population. The SNPs inside ENCODE regions show more very low frequency alleles than the SNPs outside (accurate genome-wide frequencies due to the 1000 Genomes Project), which is a characteristic sign of negative selection and is not influenced by confounders such as mutation rate of the sequence (see Figure 1 of the main ENCODE paper).

We can do that across all of ENCODE, or break it down by broad sub-classification. Across all sub-classifications we see evidence of negative selection. Sadly, it is not trivial to estimate the proportion of bases from derived allele frequency spectra that are under selection, and the numbers are far more slippery than one might think. Over the next decade there will, I think, be much important reconciliation work, looking at both experimental and evolutionary/population aspects (bring on the million-person sequencing dataset!).

Q. So – we’re really talking about things under negative selection in human – is that our final definition of “functional”?
A. If it is under negative selection in the human population, for me it is definitely functional.
I, and other people, do think we need to be open to the possibility of bases that definitely effect phenotypes but are not under negative selection –both disease related phenotypes and other normal phenotypes. My colleague Paul Flicek uses the shape of the nose as an example; quite possibly the different nose shapes are not under selection – does that mean we’re not interested in this phenotype?
Regardless of all that, we really do need a full, cast-iron set of bases under selection in humans – this is a baseline set.

Q. Do you really need ENCODE for this?
A. Yes. Imagine that THE set of bases under selection in the human genome were dropped in your lap by some passing deity. Wonderful! But you would still want to know the how and why. ENCODE is the starting place to answer the biochemical “how”. And given that passing deities are somewhat thin on the ground, we should probably go ahead and figure out models of how things work so that we can establish this set of bases. I am particularly excited about the effectiveness of using position–weight matrices in the ENCODE analyses (my postdoc Mikhail Spivakov did a nice piece of work here).

Q. Ok, fair enough. But are you most comfortable with the 10% to 20% figure for the hard-core functional bases? Why emphasize the 80% figure in the abstract and press release?
A. (Sigh.) Indeed. Originally I pushed for using an “80% overall” figure and a “20% conservative floor” figure, since the 20% was extrapolated from the sampling. But putting two percentage-based numbers in the same breath/paragraph is asking a lot of your listener/reader – they need to understand why there is such a big difference between the two numbers, and that takes perhaps more explaining than most people have the patience for. We had to decide on a percentage, because that is easier to visualize, and we choose 80% because (a) it is inclusive of all the ENCODE experiments (and we did not want to leave any of the sub-projects out) and (b) 80% best coveys the difference between a genome made mostly of dead wood and one that is alive with activity. We refer also to “4 million switches”, and that represents the bound motifs and footprints.

We use the bigger number because it brings home the impact of this work to a much wider audience. But we are in fact using an accurate, well-defined figure when we say that 80% of the genome has specific biological activity.

Q. I get really annoyed with papers like ENCODE because it is all correlative. Why don’t people own up to that?
A. It is mainly correlative, and we do own up to it. (We did do number of specific experiments in a more “testing” manner – in particular I like our mouse and fish transgenics, but not for everything.) For example, from the main paper: “This is an inherently observational study of correlation patterns, and is consistent with a variety of mechanistic models with different causal links between the chromatin, transcription factor and RNA assays. However, it does indicate that there is enough information present at the promoter regions of genes to explain most of the variation in RNA expression.”

Interestingly enough, we had quite long debates about language/vocabulary. For example, when we built quantitative models, to what extent were we allowed to use the word “predict”? Both the model framework and the precise language used to describe the model imply a sort of causality. Similarly, we describe our segmentation-based results as finding states “enriched in enhancers”, rather than saying that we are providing a definition of an enhancer. Words are powerful things.

Q. I am still skeptical. What new insights does ENCODE offer, and are they really novel? Most of the time I think someone has already seen something similar before.  
A. I think that the scale of ENCODE – in particular the diversity of factors and assays – is impressive, and although correlative, this scale places some serious constraints on models. For example, the high, quantitative correlation between CAGE tags and histone marks at promoters limits the extent to which RNA processing changes RNA levels. (This is measured by 5’ ends – n.b. if there is a considerable amount of aborted transcription generating 5’ends, this need not mean full transcripts, though this correlation is high both for nuclear isolated 5’ends and cytoplasmic isolated 5’ ends.)

As for “someone has discovered it already,” I agree that the vast majority of our insights and models are consistent with at least one published study – often on a specific locus, sometimes not in human. Indeed, given the 30 years of study into transcription, I am very wary of putting forward concepts that don’t have support from at least some individual loci studies.

ENCODE has been selecting/confirming hypotheses that are broadly genome-wide, or multi-cell line true. ENCODE is a different beast from focused, mechanistic studies, which often (and rightly) involve precise perturbation experiments. Both the broader studies and the more focused studies help define phenomena such as transcription and chromatin dynamics.

This is all in the main paper, but then the network paper (led by Mike Synder and Mark Gerstein) on transcription factor co-binding, the open chromatin distribution paper (led by Greg Crawford, Jason Lieb, John Stamatoyannopolus), the DNaseI distribution paper (led by John Stamatoyannopolus), the RNA distribution and processing paper (led by Roderic Guigo and Tom Gingeras) and chromatin confirmation paper (led by Job Dekker) all provide non-obvious insights into how different components interact. And that’s just the Nature papers – there are another 30-odd papers to read. (We hope our new publishing innovation – “threads” – will help you navigate easily to the parts of all these papers you are most interested in reading.)

Q. You talk about how this will help medicine, but I don’t see this being directly relevant?
A. ENCODE is a foundational data set – a layer on top of the human genome – and its impact will be to make basic and applied research work faster and more cheaply. Because of our systematic, genome-wide approach, we’ve been able to deliver essential, high-quality reference material for smaller groups working on all manner of diseases. And in particular the overlap to genomewide association studies (GWAS) has been a very informative analysis.

Q. Moving to the disease genetics, were you surprised at this correlation with GWAS, as the current GWAS catalog is about lead association study SNPs, and we don’t expect this to overlap with functional data.

A. This was definitely a surprise to us. When I first saw this result I thought there was something wrong with some aspect of the analysis! The raw enrichment of GWAS-lead SNPs compared to baseline SNPs (e.g. those from the 1000 Genomes Project) is very striking, and yet if the GWAS-lead SNPs are expected to be tagging (but not coincident) with a functional variant, you would expect little or no enrichment.

We ended up with four groups implementing different approaches here, and all of them found the same two results. First, that the early SNP genotyping chips are quite biased towards functional regions. By talking to some of the people involved in those early designs (ca. 2003), I learned some of this is deliberate, for instance favouring SNPs near promoters. But even if you model this in, the enrichment of GWAS SNPs over a null set of matched SNPs is still there. This is similar to that card in Monopoly: “Bank Error in your favour; please collect 10 Euros/Dollars/Pounds”. In this case, it is: “Design bias in your favour; you will have more functional variants identified in the first screen than you think”.

We think that around 10% to 15% of GWAS “lead” loci are either the actual functional SNP in the condition studied or within 200bp of the functional variant. This is all great, but we can now do something really brilliant:  break down this overall enrichment by phenotypes (from GWAS) and by functional type, in particular cell type (DNaseI) or transcription factor (TF). This matrix has a number of significant enrichments of particular phenotypes compared with factors or cell types. Some of these we understand well (e.g., Crohn’s disease and T-Helper cells); some of these enrichments are perfectly credible (e.g., Crohn’s disease and GATA-factor transcription factors); and some are a bit of a head-scratcher.

But the great thing about our data is that we didn’t have to choose a specific cell type to test or a particular disease. By virtue of being able to map both diseases and cell-specific (or transcription-factor-specific) elements to the genome, we can look across all possibilities. This will improve as we get more transcription factors and as we get better “fine mapping” of variants. This result for me alone is totally exciting:  it’s very disease-relevant, and it leverages the unbiased, open, genome-wide nature of both ENCODE and GWAS studies to point to new insights for disease.


Q. You make a fuss about these new publishing aspects, such as “threads”. Should I be excited?
A. I hope so! The idea of threads is a novel attempt by us to help readers get the most out of this body of coordinated scientific work. Say you are only interested in a particular topic – say, enhancers – but you know that different groups in ENCODE are likely to have mentioned this (in particular the technical papers in Genome Research and Genome Biology). Previously you would have had to skim the abstract or text of all 30 papers to try and work out which ones were really relevant. Threads offer an alternative, lighting up a path through the assembled papers, pointing out the figures and paragraphs most relevant to any of 13 topics and taking you all the way through to the original data. The threads are there to help you discover more about the science we’ve done, and about the ENCODE data. Interestingly, this is something that’s only achievable in the digital form, and for the first time I found myself being far more interested in how the digital components work than in the print components.

The idea of threads came from the consortium, but the journal editors, in particular Magdalena Skipper from Nature made it a reality – remember that in these threads we are crossing publishing house boundaries. The resulting web site and iPad App I think works very well. I am going to be interested to see how other scientists react to this.

Q. And what about this Virtual Machine. Why is this interesting?
A. An innovation in computing over the last decade has been the use of virtualization, where the whole state of a computer can be saved to a file, and transported to another “host” computer and then restarted. This has given us a new opportunity to increase transparency for data intensive science.
Many people have noted that complex computational methods are very hard to track in all their detail. We currently place a lot of trust in each other as scientists that phrases such as “we then removed outliers” or “we normalised using standard methods” are executed appropriately. The ENCODE virtual machine provides all these complex details explicitly in a package that will last at least as long as the open virtualization format we use (OVF, VirtualBox). So if you are a computational biologist in three years’ time, and you want to see the precise details of how we achieved something, you can run the analysis codes yourself. The only caveat to this is that for the large, compute-scale pipelines we have an exemplar processing step, and then have the results of this parallelised (i.e. we do not have a virtualised pipelines). Think of this a bit like the ultimate materials and methods section of the paper.  I believe this virtual machine substantially increases the transparency of this data-intensive science, and that we should produce virtual machines in the future for all data-intensive papers.

Q. I’ve read your Nature commentary about large projects, and admit that I'm uneasy about how these large projects throw their weight around. Isn’t there more friction and angst than you admit to?
A. There is indeed friction and angst, in particular with the smaller groups (“hypothesis testing groups, or R01 groups”) close to the scientific areas of ENCODE. I regret every instance of this and have tried my best to make things work out. After a lot of experience, I’ve realised a couple of things: Like any large beast, projects like ENCODE can inadvertently cause headaches for smaller groups. Part of this is actually due to third parties, for example reviewers of papers or grants who mistakenly think that the large datasets in ENCODE somehow replace or make redundant more focused studies. This is rarely the case – what the large project provides is a baseline dataset that is useful mainly for people who don’t have the time or inclination to do such a study and, importantly, who would not find it practical to do this work systematically (i.e. cutting to established, promising focus areas). ENCODE’s target audience is someone who needs this systematic approach, for example clinical researchers who might scan their (putatively) causative alleles or somatic variants against such a catalog. ENCODE does not replace the targeted perturbation experiment, which illuminates some aspect of chromatin or transcriptional mechanism (sometimes in a particular disease context). However, people less involved in this work can make the mistake of lumping together the mechanistic study and the catalog building as “doing ChIP-seq”, and assume they are redundant. As scientists in this area, both large and small groups need to regularly point out their explicit and non-overlapping complementarity.

Also, compared to some other scientific fields, genomics has a remarkably positive track record in data sharing and communication. We can do far better (more below), but everyone should be mindful that for all our faults, we do share datasets completely and openly, we nearly always share resources and techniques and we do communicate. Non-genomicists would be surprised sometimes at the depths of distrust in other fields. That said, there is always room for improvement. Although we did use pre-publication raw data sharing in ENCODE, we should have spent more time and effort sharing intermediate datasets (in addition to raw datasets). The 1000 Genomes Project provides an excellent example to follow.

Finally, I believe that the etiquette-based system of how to handle pre-publication data release (and I was a prominent participant in this discussion) is clumsy and out-moded:  designed for a world where data generation - not analysis - is the bottleneck. I believe we need to have a new scheme. I'm not rushing to state my own opinion here - we need to have a deliberative process that balances getting broad buy-in and ideas with a timely and practical result. 

Q. So ENCODE is all done now, right?
A. Nope! ENCODE “only” did 147 cell types and 119 transcription factors, and we need to have a baseline understanding of every cell type and transcription factor. Thankfully, NHGRI has approved the idea of pushing for this – not an unambitious task – over the next 5 years. I see there being three phases of ENCODE: the ENCODE Pilot (1% of the genome); the ENCODE scale-up (or production), where we showed that we can work at this scale and analyse the data sensibly; and next the ENCODE phase “build-out” to all cell types and factors. 

Q. So you get to do this for another five years?
A. Someone does. I have hung up my ENCODE “cat-herder-in-chief” hat, and moved onto new things, like the equally challenging world of delivering a pan-European bioinformatics infrastructure (ELIXIR). But that’s for another blog post!

Q. Be honest. Will you miss it?
A. Looking back on my ten years with ENCODE, you know, I really am going to miss this. (Okay, maybe I won't miss three-hour teleconferences running to 2am...). It has been hard work and excellent science – I’ve met and interacted with so many great scientists and have honestly had a lot of fun.

43 comments:

Pedro Beltrao said...

I also dislike the term "biological noise" to mean elements (or interactions) that are biochemically true and reproducible but unlikely to have a impact on fitness if removed. Biologically neutral might be a good way to frame it. Unfortunately, not even lack of conservation can tell you if an element is important for fitness since you can also have compensatory changes. I don't think it will be easy to get estimates to the fraction of elements (or interactions) that are biologically neutral.

NickM said...

Hi -- thanks for the post. But, basically, you seem to be saying that you chose the 80% number for the hype, even though the best value is more like 10-20%?

What bugs me, I guess, is the constant bogus narrative -- which I just heard repeated on NPR 5 minutes ago -- along the lines of "scientists of yore were dumb and thought most of the genome was junk, but these new scientists are smart and now know that most of it isn't." I think that's just wrong, don't you? There were good reasons decades ago to think most of the genome didn't/couldn't have much of what people would reasonably call function, and those reasons still exist today. Basically, a ton of it looks like byproducts of viruses and the like, a ton of it isn't under selection, and we know, and have known for decades, that genome sizes vary widely among eukaryotes, without any clear pattern regarding organism complexity. Critters can and do dispense with much of their genomes without much impact at all. Sister species can vary in genome size by 50% or more. And many species have genomes many times larger than the human genome. Like onions. There's no particular reason to think the human genome is any different.

Have you heard of the onion test?

http://www.genomicron.evolverzone.com/2007/04/onion-test/

What's your answer? Why should anyone who knows about the above facts be happy that the public is being told that 80% of the genome is known to be functional, and the scientists who "thought" otherwise were dumb and old-fashioned?

Also, cue creationist mining of these press releases in 3, 2,...

NickM said...

Wow, I didn't even look it up when I wrote about cue-creationist-quote-mining, but:

Junk No More: ENCODE Project Nature Paper Finds "Biochemical Functions for 80% of the Genome"
http://www.evolutionnews.org/2012/09/junk_no_more_en_1064001.html

bio syn said...

manufactures high quality peptides, synthesis of long RNA and related products.


Gravelpits said...
This comment has been removed by the author.
Gravelpits said...

Dear Ewan,

I cross-posted a link to this post and the main Nature article on reddit yesterday. The post made it as high as the second spot on the r/science subreddit, although it was unable to overcome another post on the supposed increased incidence of glioma in mobile phone users. :( Still, I feel happy to have at least helped to spread the news and make the science available to a larger audience. If only you guys had included some videos with cats...

Kass Schmitt said...

Wow. Congratulations on this amazing achievement! There are so many interesting and important aspects to the ENCODE project that I hardly know where to start. Actually, given that I left the field of bioinformatics 10 years ago I am inclined to focus on the bits that have implications for the practice of science in general, namely the use of 'threads' for navigating the literature and the use of virtualisation to distribute analyses (questions 17 & 18 if you'd numbered them). Lots to think about, and interesting times ahead! Thanks!

Adam Siepel said...

This is quite nice, Ewan. Thanks for taking the time.

Roderic D. M. Page said...

I like the idea of "threads" as a way of navigating through a set of papers, and unpacked the iPad app to see how they'd been put together, see http://iphylo.blogspot.co.uk/2012/09/decoding-nature-encode-ipad-app-omg-it.html. It helps that the papers were either published by Nature, or open access, intellectual property issues will be a stumbling block if we want to extend the concept across a broader range of journals. But there's a lot of potential for people to create their own threads and share them (rather than bookmark a collection of papers we could bookmark a collection of fragments).

Alate_One said...

I'll have you know I'm currently arguing with an anti-evolutionist who has taken the 80% to mean evolution is impossible. Because any mutations in that 80% would affect the function of that organism, meaning there are almost no neutral mutations.

While I think the work will be very useful in the years to come, the 80% comment is going to be a nightmare.

*bangs head against wall*

JVKohl said...

The 80% comment is a nightmare for those who think in terms of "random" models, or "random mutations." It's less problematic for those who think in terms of the epigenetic effects on intracellular signaling and stochastic gene expression required for adaptive evolution via ecological, social, neurogenic, and socio-cognitive niche construction, which obviously required nutrient chemicals (food) for individual survival and their metabolism to pheromones that control reproduction and thus species survival (in microbes to man).

If, for example, the epigenetic effect of a nutrient on stochastic gene expression is controlled by the epigenetic effect of a species specific pheromone, you have a controlled network of gene interactions which could include the epigenetic tweaking of a complex system-wide 80% due to the amount of "code" required for organisms to adaptively evolve.

Adaptive evolution, in this case, does not randomly occur. Can we not therefore expect that the amount of code would reflect the epigenetic effects of nutrient chemicals and pheromones in other species that led to the advent of man?

Or is there a random model for that?

JVKohl said...

The 80% comment is a nightmare for those who think in terms of "random" models, or "random mutations." It's less problematic for those who think in terms of the epigenetic effects on intracellular signaling and stochastic gene expression required for adaptive evolution via ecological, social, neurogenic, and socio-cognitive niche construction, which obviously required nutrient chemicals (food) for individual survival and their metabolism to pheromones that control reproduction and thus species survival (in microbes to man).

If, for example, the epigenetic effect of a nutrient on stochastic gene expression is controlled by the epigenetic effect of a species specific pheromone, you have a controlled network of gene interactions which could include the epigenetic tweaking of a complex system-wide 80% due to the amount of "code" required for organisms to adaptively evolve.

Adaptive evolution, in this case, does not randomly occur. Can we not therefore expect that the amount of code would reflect the epigenetic effects of nutrient chemicals and pheromones in other species that led to the advent of man?

Or is there a random model for that?

NickM said...

"Adaptive evolution, in this case, does not randomly occur. Can we not therefore expect that the amount of code would reflect the epigenetic effects of nutrient chemicals and pheromones in other species that led to the advent of man?

Or is there a random model for that?"

I'm glad you're on the 80% functional side. Please explain why various onions, ferns, salamanders, etc., need 5-80 times as much of this epigenetic stuff as humans do (their genomes are 5-80 times bigger than the human genome, depending on the species), while cheetahs, hummingbirds, pufferfish, Drosophila, etc., can get by with much less genome (and much less "epigenetics" or whatever) than humans have, sometimes 10% as much.

city said...

thanks for sharing..

riki jorden said...

Thanks to share this information and your knowledge, i have visited your great post.

virtual uk numbers

Mark Holland said...

I’ll be learning about some of the features as I go along. I hope that it eventually proves easier for you the reader to follow the postings and comments.

Ketone Body Assay Kit

Saravut said...

umm good site.
i have heard this before.

thank for info it seem worked.

jansee

Magnetic Crack Detector said...

good info,thanks for sharing. NDT Machine

Humaun Kabir said...
This comment has been removed by a blog administrator.
Andy Narain said...

NickM is a perfect example of what one scientist told me what the scientific field is like, "the arrogance of ignorance."

I think most of these scientist needs to read "Science, Order, and Creativity" and "Science is God", again, and again, and again...

Andy Narain said...

NickM is a perfect example of what one scientist told me what the scientific field is like, "the arrogance of ignorance."

I think most of these scientist needs to read "Science, Order, and Creativity" and "Science is God", again, and again, and again...

Shilpy Akter said...

"Tax season is right around the corner. If you are looking for someone you can trust to prepare your tax return go to www.davehallsba.com"
For more information plz click this link
Start-up Business
small business help
business success
business entity
tax help

alfred said...

This product is best way to protect and personalize your iPhone. It gives your phone a dazzling and head-turning look. The case is completely covered with numerous rhinestones which protect your iPhone from wear and tear. Sparkling embedded rhinestones make your phone more luxurious which helps prevent from scratches and chips. This perfect fitting case makes your phone look like it has an invisible shield. Many rhinestones are individually applied to create designs like; stunning animal print design.


Head Case Designs

Claudio Timbers said...

the progress being made on the construction of a nuclear fuel fabrication facility that will be jointly managed by the Shaw Group and AREVA. covina plumber

dokan sam said...
This comment has been removed by a blog administrator.
Maria Mike said...
This comment has been removed by a blog administrator.
Mely Nida said...

Do you know Triactol, Total Curve, Breast Actives, Miroverve and Max Bust 36 are Top Breast Enhancement Product in USA! Get Bigger breast with top breast enlargement pills and cream!

Pamela Nanderson said...

I have just found the Press Releases and news about Green Coffee Bean Max, Zeta Clear, Provillus, Garcinia Cambogia Select, Raspberry Ketone Max, Breast Actives, Phen375, Triactol, Venapro, Wartrol and African Mango Plus on top Social media sites and Newspaper, so i thought to share it with you guys!

sasha grey said...

Dr. OZ has recently endorsed 4 products on his Show as best Fat Burner Supplement for quick and Safe Weight Loss! Garcinia Cambogia Select, Green Coffee Bean Max, Raspberry Ketone Plus and African Mango Plus are Those Top 4 Dr OZ Weight Loss Pills!

kristel heyer said...
This comment has been removed by a blog administrator.
ruby aana said...
This comment has been removed by a blog administrator.
Jessica Simpson said...

Hi...
YOU have an great thoughts !! Reflective insulation

Roselyn Schalck said...
This comment has been removed by a blog administrator.
Aliza Billinger said...
This comment has been removed by a blog administrator.
och ernestine said...
This comment has been removed by a blog administrator.
Arianna Sins said...
This comment has been removed by a blog administrator.
Brielle Franklin said...

I found this post very interesting. I have been looking for information on onsite IT services when I came across your post. I just wanted to say thanks so much for this great information. I will be using parts of this in one of my papers. Thanks for sharing.

yangdai bing said...

Very informative and worthy post. Thanks for the sharing such a precious updates with us
iphone parts

Jewel Hossain said...

It is something I am highly interested in for my business so I appreciate it!
cloud support

jasmine smith said...
This comment has been removed by a blog administrator.
Aaron Lester said...
This comment has been removed by the author.
Aaron Lester said...

Just a test post

Aaron Lester said...

Cloud Support