Biology has changed a lot
over the past decade, driven by ever-cheaper data gathering technologies:
genomics, transcriptomics, proteomics, metabolomics and imaging of all sorts. After
a few years of gleeful abandon in the data generation department, analysis has
come to the fore, demanding a whole new outlook and on-going collaboration
between scientists, statisticians, engineers and others who bring to the table a very broad range of
skills and experience.
Finding meaning in these beautiful datasets and connecting
them up, particularly when they are extremely different from one another, is a detail-riddled
journey fraught with perils. Innovation is happening so quickly that trusty
guides are rather thin on the ground, so I’ve tried to put down some of my hard-won
experience, mistakes and all, to help you plan, manage, analyse and deliver these
projects successfully.
Without up-front planning, you won’t really have much of a
‘project’. Throwing yourself into data gathering just because it’s ‘cheap’ or ‘possible’
is really not the best thing to do (I've seen this happen a number of times - and embarrassingly I've done it myself). ‘Wrong’ experiments are time vampires: they
will slurp up a massive amount of your time and energy, potentially exposing
you to reputational risk in the event you are tempted to force a result out of
a dataset.
This post, the first of three, is about having the strongest
possible start for your project via good planning.
1. Buddy up
In the olden days, experimental biologists would generate a
bunch of data and then ask a bioinformatician how to deal with it. Well,
that didn’t work too well. At the very outset of a project, we have all learned that you
need to ensure there are two PIs: one to focus on the
experimental/sample-gathering side, and one to keep the analysis in their sights at all
times. These two PIs must have healthy, mutual respect, and be motivated by the same overall goal. There are a few, rare individuals who can honestly be described as being
both experimental and computational, but in most cases you’ll need two people
to make sure both perspectives are represented in the study’s design.
Now, I’m not saying that experimentalists are strangers to
analysis, or that bioinformaticians are strangers to data generation. It’s just
important to acknowledge that being able to ‘talk the talk’ of another
discipline does not, on its own, qualify you to manage that end of the project,
with all its complexities, gotchas and signature fails.
As with anything you set out to do for a couple of years,
you’ll need to make sure you are working with someone you get on with. There
will be tense moments, and you’ll get past them if your co-PI shares your
motivation and goal. Provided you get on and share information as you go,
buddying up will save you resources in the long run.
Note to Experimental PIs: never assume you’ll be able to tack on an
analytic collaboration at the end, after you’ve gathered the data. You don’t
want to be caught out by not having considered some important analysis aspect.
Note to Computational PIs: Never assume you can delegate sample
management and experimental details to a third party through facility
technicians. You know there is a huge difference between experimental data and
good experimental data – you will need a trusted experimental partner who
understands all the relevant confounders and lab processes, and who can spot a serendipitous result if one pops out, if you’re going to have a successful project.
2. Outlining
The idea that you can generate datasets first and then watch
your results emerge from the depths is simply misguided. It is really quite
painful (and wasteful) when a dataset doesn’t have what it needs to support an
analysis - it is a set-up for forcing results. Before you do anything, have a brief
discussion with your co-PI about the main questions you are looking to answer
and make a high-level sketch of the project.
I’m not talking about a laboured series of chapter outlines
– the main thing is to determine the central question. Large-scale
data-gathering projects often focus on basic, descriptive things, like, “How much
of phenomenon X do we see under Y or Z conditions?” Sometimes the questions are
more directed, for example, “How does mitosis coordinate with chromosomal
condensation?”
Outlining your hypotheses need only be as simple as, “At the end, we will have a list of proteins in the Q process.” If you’re hoping to test a hypothesis, aim for something straightforward, like, “I believe the B process is downstream of the Ras process.”
Consider your possible hypothesis-testing modes, but avoid trying too hard to imagine where the analysis might take you; your data and analysis might not agree with your preconceptions in the end.
Also, do not commit to specific follow-up strategy too early! Your follow-up strategy should be determined after your initial analysis has been explored, or your pilot study has been performed.
3. Back-of-the-envelope ‘power calculations’
Take some of the anxiety out of the process by doing a rough
calculation before getting into things too deeply. If you (or someone else) has
done a similar analysis well in the past, simply use their analysis as a basis
for your rough estimate. If you are on completely new ground, make sure you
factor in false positives (e.g. mutation calls, miscalled allele-specific
events, general messiness) and pay careful attention to frequencies (e.g. alleles,
rare cell types).
Many a bad project could have been stopped in its tracks by a
half hour’s worth of power analysis. Unless you really need to impress
reviewers, you probably don’t need to go overboard – just make a quick sketch.
But be honest with yourself! It is all too easy to fudge the numbers in a power
analysis to get an answer you want. Use it as a tool for looking honestly at what sort of results you could expect.
4. Get logistical
Plan the logistics according to Sod’s Law. Assume everything
that can go wrong will go wrong at least once. This is particularly important if
you are scaling up, for example moving an assay from single-well/Eppendorf to an
array.
For assays, give yourself at least a year for scale-up in
the lab (better still, do a pilot scale-up with publication before moving on to the real
thing). Pad out all sample acquisition with at least three months for general
monkeying around.
5. Have a healthy respect for confounders
Think about the major confounders you will encounter
downstream, and randomise your experimental flow accordingly. That is, do not just do all of state X first,
then progress to state Y, then Z.
Make sure you store all the known confounders (e.g. antibody
batch number, day of growth). Try to work off single antibody batches/oligo
batches for key reagents. If you know you will need more than one batch, remember
the randomisation! You absolutely do not want the key reagent batch being
confounded with your key experimental question, i.e. normal with batch 1,
disease with batch 2. Disaster!
6. Plan the analysis
If possible, stagger the experimental and analysis work. See if you can have your analysis postdocs come in to the project
later, ideally with some prior knowledge of the work (the best case is that they are around but on another project early, and then switch into this project about a year in). Unfortunately, because
funding agencies like to have neat and tidy three-year projects, this is often
quite difficult to arrange.
Determine when an initial dataset will be available, and time
the data coordination accordingly. Budget at least six months (more likely
12–18 months) of pure computational work. Use early data to ‘kick the
tyres’ and test different analysis schemes, but plan to have a single run of analysis that takes at least 12-18 months.
7. Replication/validation strategy
You know you’re not going to cook up the data and analysis,
but will you convince the sceptical reader? Make sure you have a strategy in
place.
I find it helps to think of this as two separate phases:
discovery, and validation/replication. In discovery, you have plenty of freedom to try out
different methods and normalisation before settling. The validation/replication phase, for
a project of any size, features ‘single-shot’ experiments, which offer a minimal amount of flexibility.
Generally speaking, you should not be doing single-sample-per-state
experiments; rather, you should be carrying out at least two biological
replicates, which is enough to show up any problem. With five or more biological
replicates, you can make good mean estimates. The one exception to the "no single sample" rule is QTL/GWAS, when it
is nearly always better to sample new genotypes each time, rather than
replicate data from the same genotype (i.e. maximise your genetic samples first, and then improve on per-genetic individual variance).
8. Confront multiple testing
How many tests are you going to do? If it is genome-wide project, you will do a lot, so you need to control for your multiple testing. This is partly about the power calculation, but requires some up-front thinking. Will you do permutations,
or trust to the magic of p.adjust() (A wonderful R function that has a set of False Discovery Rate approaches)? What will you do if you find nothing? Is
finding nothing interesting in itself?
You’ll all have agreed to try and discover something
excellent, but make sure you have a serious conversation up front with your co-PI
about what you’ll do if you don’t find anything interesting. Is there a fall-back plan?
Traditional, outright replication of an entire discovery
cohort needs as much logistical planning – if not more – as the discovery itself. You
might decide to use prior data to show how yours is at least solid and
good. Organise this beforehand.
9. Publishing parameters
What would you consider to be the first publishable output
from this project? Could you put it into a technical publication (e.g. assay
scale up, bespoke analytical methods)?
At the beginning of the project, you and your co-PI should agree on the broad
parameters of authorship on papers, and how multiple papers might be
coordinated. For example, will you credit two first authors and two last
authors, swapping in priority if there is more than one paper?
If you are a more senior partner in a collaboration, be generous with your “last last” position. Your junior PI partners need it more than you do!
Next up
This is the first of three posts. Next up: Managing your Big Data project.
Happy to be able to visit your website thanks
ReplyDeleteObat Kolesistitis Tradisional
Pengobatan Herbal Untuk Menyembuhkan Stenosis Spinal
Pengobatan Tradisional Iridosiklitis Akut Yang Aman Serta Efektif
Cara Mengobati Rasa Perih Di Lambung Secara Alami
Cara Mengobati Teratozoospermia Secara Alami
Ketahui Gejala Tumor Otak Dan Cara Mengatasinya
Obat Anemia Alami Yang Bagus
Obat Alami Untuk Oligoasthenoteratozoospe
Cara Menghilangkan Selaput Putih Di Mata Secara Alami
Cara Mengatasi Pengapuran Tulang Dan Sendi
Ahlinya Obat Migren