This is the second of three blog posts about planning,
managing and delivering a ‘big biodata’ project. Here, I share some of my
experience and lessons learned in management and analysis – because you can’t
have one without the other.
Management
1. Monitor progress – actively!
You need a good structure to monitor progress, particularly
if you’re going to have more than 10 samples of experiments. If this is a
one-off, use a program that’s well supported at your institute, like FileMakerPro or... Excel (more on this below). If you’re going to do this a lot, think about investing in a LIMS
system, as this is better suited to handling things at a high level of detail routinely.
Whatever you use, make sure your structure invites active updating – you’ll
need to stay on top of things and you don’t want to struggle to find what you need.
2. Excel (with apologies to the bioinformaticians)
Most bioinformaticians would prefer the experimental group not to use Excel for tracking, for very good reasons: Excel provides too much freedom, has extremely annoying (sometimes dangerous) "autocorrect" schemes, fails in weird ways and is often hard to integrate into other data flows. However, it is a pragmatic choice for data entry and project management due to its ubiquity and familiar interface.
Experimental group: before you set up the project tracking in Excel, discuss it with your computational colleagues, perhaps offering a bribe to soften the blow. It will help if the Excel form template comes out of discussions with both groups, and bioinformaticians can set up drop-downs with fixed text where possible and use Excel’s data entry restrictions to (kind of) bullet proof it.
One important thing with Excel: NEVER use formatting or colour to be the primary store
of meaning. It is extremely hard to extract this information from Excel into other schemes. Also, two things might look the same visually (say, subtly different shades of red), but are computationally as different as red and blue. When presentation matters (often to show progress against targets), you or your colleagues can (pretty easily) knock
up a pivot table/Excel formula/visual basic solution to turn basic information
(one type in each column) into a visually appealing set of summaries.
3. Remember planning?
When you planned the project (you did plan, right?),
you decided on which key confounders and metadata to track. So here’s where you
set things up to track them, and anything else that’s easy and potentially
useful. What’s potentially useful? It’s hard to say. Even if things look
trivial, they (a) might not be and (b) could be related to something complex that
you can’t track. You will thank yourself later for tracking things when you
regress this out.
4. Protect your key datasets
Have a data ‘write only’ area for storing the key datasets
as they come out of your sequencing core/proteomics core/microscopes. There are
going to be sample swaps (have you
detected them? For sure they will be there for any experimental scheme with more than 20 samples), so don’t edit the received files directly! Make sure you have
a mapping file, kept elsewhere, showing the relationships between the original
data and the new fixed terms.
5. Be meticulous about workflow
Keep track of all the individual steps and processes in your
analysis. At any point, it should be possible to trace individual steps back to
the original starting data and repeat the entire analysis from start to finish.
My approach is to make a new directory with soft-links for each ‘analysis prototyping’, then lock down components for a final run. Others make heavy use of iPython notebooks – you might well have your own tried-and-tested approach. Just make sure it’s tight.
My approach is to make a new directory with soft-links for each ‘analysis prototyping’, then lock down components for a final run. Others make heavy use of iPython notebooks – you might well have your own tried-and-tested approach. Just make sure it’s tight.
6. “Measure twice, cut once”
If you are really, really careful in the beginning, the
computational team will thank you, and may even forgive you for using Excel. Try
to get a test dataset out and put it all the way through as soon as possible.
This will give you time to understand the major confounders to the data, and to tidy things up before the full analysis.
You may be tempted to churn out a partial (but in a more limited sense ‘complete’) dataset early, perhaps even for a part-way publication. After some experience playing this
game, my lesson learned is to go for full randomisation every time, and not to have a partial, early dataset that breaks the randomisation of the samples against time or key reagents. The alternative is the commit to a separate, early pilot experiment, which explicitly will not be combined with the main analysis. It is fine
for this preliminary dataset to be mostly about understanding confounders and
looking at normalisation procedures.
7. Communicate
It is all too easy to gloss over the social aspect of this
kind of project, but believe me, it is absolutely essential to get this right.
Schedule several in-person meetings with ‘people time’ baked in (shared dinner,
etc.) so people can get to know each other. Have regular phone calls involving
the whole team, so people have a good understanding of were things stand at any
given time. Keep a Slack channel or run an email list open for all of those little exchanges that
help people clarify details and build trust.
Of course there will be glitches – sometimes quite serious –
in both the experimental work and the analysis. You will have to respond to
these issues as a team, rather than resorting to finger-pointing. Building
relationships on regular, open communication raises the empathy level and helps
you weather technical storms, big and small.
Analysis
1. You know what they say about ‘assume’
Computational team: Don’t
assume your data is correct as received – everything that can go wrong,
will go wrong. Start with unbiased clustering (heat-maps are a great
visualisation) and let the data point to sample swaps or large issues. If you
collect data over a longer period of time, plot key metrics v. time to see if
there are unwanted batch/time effects. For sample swaps, check things like
genotypes (e.g. RNAseq-implied to sample-DNA genotypes). If you have mixed
genders, a chromosome check will catch many sample swaps. Backtrack any
suspected swaps with the experimental team and fail suspect samples by default. Sample swaps are the same as bugs in analysis code - be nice to the experimental team so they will be nice when you have a bug in your code.
Experimental team: Don’t
assume the data is correct at the end of an analytical process. Even with
the best will in the world, careful analysis and detailed method-testing
mistakes are inevitable and flag results that don't feel right to you. Repeat appropriate sample-identity checks at key time
points. At absolute minimum, you should perform checks after initial data receipt
and before data release.
2. One thing you can assume
You can safely assume that there are many confounders to
your data. But thanks to careful planning, the analysis team will have all
the metadata the experimental team has lovingly stored to work with.
Work with untrained methods (e.g. straight PCA; we’re also very fond
of PEER in my group), and correlate the known covariates. Put the big ones in
the analysis, or even regress them out completely (it’s usually best to put
them in as terms in the downstream analysis). Don’t be surprised by strongly
structured covariates that you didn’t capture as metadata. Once you have
convinced yourself that you are really not interested, move on.
(Side note on PCA and PEER: take out the means first, and
scale. Otherwise, your first PCA component will be means, and everything else
will have to be orthogonal to that. PEER, in theory, can handle that non-orthogonality, but it's a big ask, and the means in particular are best removed. This means this is all wrapped up with normalisation, below.)
3. Pay attention to your reagents
Pay careful attention to key reagents, such as antibody or
oligo batches, on which your analysis will rely. If they are skewed, all sorts
of bad things can happen. If you notice your reagent data is skewed, you’ll have
to make some important decisions. Your carefully prepared randomisation
procedure will help you here.
4. The new normal
It is highly unlikely that the raw data can just be put into downstream analysis schemes - you will need to normalise. But what is your normalisation procedure? Lately, my mantra is,
“If in doubt, inverse normalise.” Rank the data, then project those ranks back
onto a normal distribution. You’ll probably lose only a bit of power – the
trade-off is that you can use all your normal parametric modelling without
worrying (too much) about outliers.
You need to decide on a host of things: how to correct for
lane biases, GC, library complexity, cell numbers, plate effects in imaging. Even
using inverse normalisation, you can do this in all sorts of ways (e.g. in a genome direction or a
per-feature direction – sometimes both) so there are lots of options, and no automatic flow chart about how to select the right option.
Use an obvious technical normalisation to start with (e.g. read
depth, GC, plate effects), then progress to a more aggressive normalisation (i.e.
inverse normalisation). When you get to interpretation, you may want to present
things in the lighter, more intuitive normalisation space, even if the
underlying statistics are more aggressive.
You’ll likely end up with three or four solid choices through this flow chart. Choose
the one you like on the basis of first-round analysis (see below). Don’t get
hung up on regrets! But if you don’t discover anything interesting, come back to this point and choose again. Taking a more paranoid approach, using two normalisation schemes through
the analysis will give you a bit of extra security - strong results will not change too much on different "reasonable" normalisation approaches.
5. Is the Data good?
Do a number of ‘data is good’ analyses.
These answers can guide you to the ‘right’ normalisation strategy - so flipping between normalisation procedures and these sorts of "validation" analyses helps make the choice of the normalisation.
- Can you replicate the overall gene-expression results?
- Is the SNP Tv/Ts rate good?
- Is the number of rare variants per sample as expected?
- Do you see the right combination of mitotic-to-nonmitotic cells in your images?
- Where does your dataset sit, when compared with other previously published datasets?
These answers can guide you to the ‘right’ normalisation strategy - so flipping between normalisation procedures and these sorts of "validation" analyses helps make the choice of the normalisation.
6. Entering the discovery phase
‘Discovery’ is a good way to describe the next phase of the
analysis, whether it’s differential-expression or time-course or GWAS. This is where one needs to have quite a bit more discipline in how to handle the statistics.
First, use a small (but not too small) subset of the data to test your pipelines (in Human, I am fond of the small, distinctly un-weird chromosome 20). If you can make a QQ plot, check the QQ plot looks good (ie, calibrated). Then, do the whole pipeline.
First, use a small (but not too small) subset of the data to test your pipelines (in Human, I am fond of the small, distinctly un-weird chromosome 20). If you can make a QQ plot, check the QQ plot looks good (ie, calibrated). Then, do the whole pipeline.
7. False-discovery check
Now you’re ready to apply your carefully thought-through ‘false
discovery rate’ approach, ideally without fiddling around. Hopefully your QQ plot looks good (calibrated with a kick at the end), and you can roll out false discovery control now. Aim to do this just once (and when that happens, be very proud).
8. There is no spoon
At this point you will either have some statistically
interesting findings above your false discovery rate threshold, or you won’t have anything above threshold. In neither
case should you assume you are successful or unsuccessful. You are not there
yet.
9. Interesting findings
You may have nominally interesting results, but don’t trust
the first full analysis. Interesting results often enrich errors and artefacts earlier on in your process. Be paranoid about the underlying variant
calls, imputation quality or sample issues.
Do a QQ plot (quantile-quantile plot of the P values,
expected v. observed). Is the test is well calibrated (i.e. QQ plot starts on
the expected == observed, with a kick at the end)? If you can’t do a straight-up
QQ plot, carry out some close alternative so you can get a frequentist P value out. In my experience, a
bad QQ plot is the easiest way to spot dodgy whole-genome statistics.
Spot-check that things make sense up to here. Take one or
two results all the way through a ‘manual’ analysis. Plot the final results so
you can eyeball outliers and interesting cases. Plot in both normalisation spaces (i.e. ‘light’ and
aggressive/inverse).
For genome-wide datasets, have an ‘old hand’ at genomes/imaging/proteomics eyeball either all results or a random subset on a browser. When weird things pop up ("oh, look, it’s always in a zinc finger!"), they might offer an alternative (and hopefully still interesting, though often not) explanation. Talk with colleagues who have done similar things, and listen to the war stories of nasty, subtle artefacts that mislead us all.
For genome-wide datasets, have an ‘old hand’ at genomes/imaging/proteomics eyeball either all results or a random subset on a browser. When weird things pop up ("oh, look, it’s always in a zinc finger!"), they might offer an alternative (and hopefully still interesting, though often not) explanation. Talk with colleagues who have done similar things, and listen to the war stories of nasty, subtle artefacts that mislead us all.
10. ‘Meh’ findings
If your results look uninteresting:
- Double check that things have been set up right in your pipeline (you wouldn’t be the first to have a forehead-smacking moment at this point).
- Put dummy data that you know should be right into the discovery pipeline to test whether it works.
- Triple-check all the joining mechanisms (e.g. the genotype sample labels with the phenotype).
- Make sure incredibly stupid things have not happened – like the compute farm died silently, and with spite, making the data files look valid when they are in fact… not.
11. When good data goes bad
So you’ve checked everything, and confirmed that nothing obvious
went wrong. At this point, I would allow myself some alternative normalisation approaches, FDR thresholding or confounder manipulation. But stay disciplined here.
Start a mental audit of your monkeying around (penalising yourself appropriately in your FDR). I normally allow four or five trips on the normalisation merry-go-round or on the “confounders-in-or-out” wheel. What I really want out of these rides is to see a P value / FDR rate that’s around five-fold better than a default threshold (of, say 0.1 FDR, so hits at 0.02 FDR or better).
Often you are struggling here with the multiple testing burden if there is a genome-wide scan. If you are not quite there with your FDRs, here are some tricks: examine whether just using protein-coding genes will help the denominator, and look at whether restricting by expression level/quantification helps (i.e. removing lowly expressed genes which you couldn't find a signal in anyway).
Start a mental audit of your monkeying around (penalising yourself appropriately in your FDR). I normally allow four or five trips on the normalisation merry-go-round or on the “confounders-in-or-out” wheel. What I really want out of these rides is to see a P value / FDR rate that’s around five-fold better than a default threshold (of, say 0.1 FDR, so hits at 0.02 FDR or better).
Often you are struggling here with the multiple testing burden if there is a genome-wide scan. If you are not quite there with your FDRs, here are some tricks: examine whether just using protein-coding genes will help the denominator, and look at whether restricting by expression level/quantification helps (i.e. removing lowly expressed genes which you couldn't find a signal in anyway).
You may still have nothing over threshold. So, after a drink/ice cream, open up your plan to the “Found Nothing Interesting” chapter (you did that
part, right?) and follow the instructions.
Do stop monkeying around if you can’t get to that 0.02 FDR. You
could spend your career chasing will-o-the-wisps if you keep doing this. You
have to be honest with yourself: be prepared to say “There’s nothing here.” If
you end up here, shift to salvage mode (it’s in the plan, right?).
12. But is it a result?
Hopefully you have something above threshold, and are pretty
happy as a team. But is it a good biological result? Has your FDR
merry-go-round actually been a house of mirrors? You don’t want to be in any
doubt when you go to pull that final trigger on a replication / validation experiment.
It may seem a bit shallow, but look at the top genes/genomic
regions, and see if there is other interesting, already-published data to
support what you see. I don't, at this point, trust GO analysis (which often is "non random"), but the Ensembl phenotype per gene feature is really freakily useful (in particular with its ‘phenotypes in
orthologous genes’ section) and the UniProt comments section. Sometimes you stumble across a complete amazing
confirmation at this point, from a previously published paper.
But be warned:
humans can find patterns in nearly everything – clouds, leaf patterns,
histology, and Ensembl/UniProt function pages. Keep yourself honest by inverting the list of genes, and
look at the least interesting genes
from the discovery process. If the story is overtly consistent from bottom to
top, I’d be sceptical that this list actually provides confirmation. Cancer
poses a particular problem: half the genome has been implicated in one type of cancer
or another by some study.
Sometimes though you just have a really clean discovery dataset, with no previous literature support,
and you need to do the replication in place without any more confidence that
your statistics are confirming something valuable.
13. Replication/Validation
Put your replication/validation strategy into effect. You might have
baked it into your original discovery. Once you are happy with a clean (or as
clean as you can get) discovery and biological context, it’s time to pull the
trigger on the strategy. This can be surprisingly time consuming.
If you have specific follow-up experiments, sort some of
them out now and get them going. You may also want to pick out some of the juiciest
results to get additional experimental data to show them off. It’s hard to
predict what the best experiment or analysis will be; you can only start
thinking about these things when you get the list.
My goal is for the replication / validation experiments to be as unmanipulated as possible, and you should be confident that they will work. It's a world of pain when they don't!
14. Feed the GO
With the replication/validation strategy underway, your analysis can
now move onto broader things, like the dreaded GO enrichments. Biology is very
non-random, so biological datasets will nearly always give some sort of
enriched GO terms. There are weird confounders both in terms of genome
structure (e.g. developmental genes are often longer on the genome) and
confounders in GO annotation schemes.
Controlling for all this is almost impossible, so this is
more about gathering hints to chase up in a more targeted analysis. Or to
satisfy the “did you do GO enrichment?” requirement that a reviewer might ask. Look at other things, like
related datasets, or orthologous genes. If you are in a model organism, Human
is a likely target. If you are in Human, go to mouse, as the genome-wide
phenotyping in mouse is pretty good now). Look at other external datasets you can bring in, for example Chromatin states on the genome, or lethality data in model organisms.
15. Work up examples
Work up your one or two examples, as these will help people
understand the whole thing. Explore data visualisations that are both appealing
and informative. Consider working up examples of interesting, expected and even
boring cases.
16. Serendipity strikes!
Throughout this process, always be ready for serendipity to strike. What might look
like a confounder could turn out to be a really cool piece of biology – this
was certainly the case for us, when we were looking at CTCF across human individuals and found a really interesting CTCF behaviour involved in X inactivation.
My guess is that serendipity has graced our group in one out of every ten or
so projects – enough to keep us poised for it. But serendipity must be
approached with caution, as it could just an artefact in your data that simply
lent itself to an interesting narrative. If you’ve made an observation and
worked out what you think is going on, you almost have to create a new
discovery process, as if this was your main driver. It can be frustrating,
because you might now not have the ideal dataset for this biological question. In the worst case, you might have to set
up an entirely new discovery experiment.
But often you are looking at a truly interesting phenomenon
(not an artefact). In our CTCF paper, the very allele-specific behaviour of two
types of CTCF sites we found was the clincher: this was real (Figure 5C). That was a
glorious moment.
17. Confirmation
When you get the replication dataset in, slot it into place.
It should confirm your discovery. Ideally, the replication experiments fit in
nicely. Your cherry-on-the-cake experiment or analysis will show off the top
result.
18. Pat on the back if it is boring
The most important thing to know is that sometimes, nothing
too interesting will come out of the data. Nobody can get a cool result out every large scale experiment. These will be sad moments for you and the team, but be
proud of yourself when you don’t push a dataset too far - and for students and postdocs, this is why having two projects is often good. You can still publish an
interesting approach, or a call for larger datasets. It might be less exciting,
but it’s better than forcing a result.
I think data analysis is going to have a huge impact on management theory and worker-employer relations. Thanks to the analytic tool we have, it is possible to find out trend previously hidden. Our company is using machine learning algorithms to process resumes and job applications. It tends to be very useful.
ReplyDeleteHappy to be able to visit your website thanks
ReplyDeleteObat Kolesistitis Tradisional
Pengobatan Herbal Untuk Menyembuhkan Stenosis Spinal
Pengobatan Tradisional Iridosiklitis Akut Yang Aman Serta Efektif
Cara Mengobati Rasa Perih Di Lambung Secara Alami
Cara Mengobati Teratozoospermia Secara Alami
Ketahui Gejala Tumor Otak Dan Cara Mengatasinya
Obat Anemia Alami Yang Bagus
Obat Alami Untuk Oligoasthenoteratozoospe
Cara Menghilangkan Selaput Putih Di Mata Secara Alami
Cara Mengatasi Pengapuran Tulang Dan Sendi
Ahlinya Obat Migren
Zest for lunch today? Thank you for the information
ReplyDeleteObat Alami Untuk Meningkatkan Daya Ingat
Pengobatan Alternatif Sakit Maag Secara Tradisional
Pengobatan Alternatif Nyeri Sendi Bahu Tradisional
Pengobatan Alternatif Nyeri Sendi Bahu Tradisional
Obat Tradisional Darah Tinggi
present information this morning really means a lot to us thanks .
ReplyDeletePengobatan Tradisional Radang Lambung
Obat Kista Duktus Tiroglosus
Ginkgo Biloba Plus Capsule Green World
Pengobatan Alami Hipertensi Paling Mujarab
Cara Menghilangkan Lendir Di Paru-Paru Secara Alami
Analogica Data is one of the Top Big Data Analysis Company in India.provides services like Dashboarding and Visualisation,Big Data Analysis,Internet Of Things,Data Warehousing,Data Mining and Machine Learning.
ReplyDelete