Thursday, 30 August 2007

Orthologs and Paralogs

I am sitting in a talk (Interactome meeting) and the speaker is using InParanoid orthologs. At Ensembl we've adopted the TreeFam scheme for ortholog definition, and after alot of sweat to create statistics that assess the difference between orthologs sets, there is not a huge difference between InParanoid and TreeFam/Ensembl ortholog calls. (TreeFam/Ensembl is a little better, of course, but it always amazes me how good "simple" approaches can be).

But the real benefit in the TreeFam scheme is the use of genuine phylogenetic trees than just ortholog lists. The tree is the best way to represent the evolution of the gene family. At Ensembl we annotate internal nodes of the tree as either speciation or duplication nodes. From this one can ask far more sophisticated questions than just "which gene human is the ortholog of this gene in drosophila". One can ask for example "what are the ancient paralogs of this human gene due to the presumed whole genome duplication in vertebrates" or "for this expanded gene family, which genes would be present in the putative eutherian ancestor".

We visualise trees using GeneTreeView:

http://www.ensembl.org/Homo_sapiens/genetreeview?db=core;gene=ENSG00000120306


These trees are nice to see, but now are a bit unwidely due to the number of species in Ensembl - we need to have options to show "just these species".

Saturday, 18 August 2007

blogging from northumberland

here I am, working away (finished up a strategy paper, now working on
a grant) in Northumberland.

Northumberland? Where's Northumberland I hear you ask. It's the most northern county in England, further north than quite a bit of scotland due to the sharp south-west to north-east border line. It's England's least populous area, wild and untouched.

And, yes, I've got broadband. BT will ensure that anyone can get broadband, and when we applied for it they had to run a new line, completely with DSL repeaters (I think) down the valley. Add a bit of wireless (just got a wireless extender that works via the powerline)
and... one wired house.

The connection is fine for email and web; a little jumpy for X sessions, but given the location, nothing to complain about.

So. Here I am, glass in wine in hand, blogging from the most remote place in England.

Friday, 17 August 2007

Sequence align view

A recent addition to Ensembl has been sequence alignview, to handle resequencing information. An example link is:

http://www.ensembl.org/Homo_sapiens/sequencealignview?gene=ENSG00000139618;individuals=HuAA;individuals=HuBB;individuals=HuCC

The framework for this data has been in placefor a while. Now we have probably the most obvious display of this - a multiple alignment of individuals or strains. For human individuals, as well as the 4 "Celera" humans, we will have Craig Venter's genome and Jim Watson's genome in soon. (There has been a persistent rumour that one of the 4 celera individuals was Craig, so that probably gives us 5 individuals overall, and only two, Craig and Jim, with high enough coverage to call Hetreozygote positions).


This differs from SNP data in one crucial way. One knows the difference between a base which is the same as the reference from a base which is not ascertained. This is critical for a bunch of applications. There are a whole bunch of headaches - aligning this many reads is just an engineering challenge first off, then dealing with issues about structural variants and hetreozygote calling is non trivial. But it is definitely the way the world is going, and this framework allows us to handle resequencing data in humans - and other species - elegantly.

This is a little bit of distraction for me - I should be doing other things, but I am between two quite complex documents (reviewing a long article and writing yet another strategy document) so I decided to dip my foot into blogging again. There are two real motivations for this. Firstly there is quite a few "general" things that I muse about which I would like other people to easily get access to - currently the people who get to give me feedback on some of these ideas are those I happen to have coffee with. Secondly blogs are clearly a way to keep the brands one knows and loves high the google-ranks and possibly also high in people's only surfing habits, and that's something I want to do - especially for Ensembl, but also for my other projects; Reactome and the projects I've grandfathered from - Pfam, Bioperl and the rest. So - expect numerous musings for both reasons.

Right. Now back to "real" work.