Ewan's Blog: Bioinformatician at large: Sharing clinical data: everyone wins

Patients who contribute their data to research are primarily motivated by a desire to help others with the same plight, through the development of better treatments or even a cure. Out of respect for these individuals, and to uphold the fundamental tenets of the scientific process, I’d like the clinical trials community to shift its default position on data sharing and reuse to align to data availability on publication, similar to the life science community. This will enable more robust, rigorous research, create new opportunities for discovery and build trust between patients and scientists.

This aspiration is widely shared in the basic research community, and has been well articulated in considered and public discussions such as a series led by the National Academy of Sciences in 2003. Nevertheless, recent articles in the New England Journal of Medicine have pushed against data sharing, calling those who reuse data “research parasites” (followed by a bit of clarification) and concern about how best to structure clinical trial data sharing with a lengthy and complex embargo procedure, potentially including payment.

A tradition of sharing

Sharing tools has been the norm (mostly) for genetics and molecular biology since the early days of genetics, mainly because you couldn’t really get anything done unless people let you use their reagents. This has persisted for over a century, from the first studies of fly lines to cDNA clones, enzymes, antibodies and, now, ‘omics datasets.

The Protein Data Bank in the 1970s, the EMBL Bank (now ENA) and GenBank nucleotide collections in the 1980s and the Human Genome Project in the 1990s all thrived thanks to the norms of reagent sharing and data deposition, and the returns to science were - and are still - huge. Such practices are pragmatic in terms of both data quality and author credit, each of which provides incentives for researchers.

I am perhaps painting a beautiful picture of an imperfect world – there is still much to be done to ensure all this data sharing can work. Compliance, agreeing on things like adaptable standards, and keeping the infrastructure humming are all challenges we grapple with on a daily basis in molecular biology. But we have much to be proud of, and embracing the ethos of sharing has brought us a long way in a short time.

Data release: why?

Releasing data when you publish a paper isn’t about giving things up – although I can see that for some, the lack of instant reward might make one feel that way. Data release is not about rewarding a single PI; it’s about benefitting the clinical research community as a whole, and making the most of the data entrusted to you by patients. So - why release data?

We are custodians, not owners, of patient data.

Patients participate in trials to further medical research, benefit from new medicines (potentially) and gain from focused care and advice. But numerous surveys have shown that participants are primarily motivated to share their data – the most valuable aspect of a clinical trial – by the altruistic desire help others in the future.

So it is very strange that some researchers feel justified in assuming the data produced in a clinical trial is somehow their own scientific property. From the perspective of patient care, this position is particularly questionable when it impedes the ability of other scientists to re-examine the data for additional studies, which would contribute to the progress so eagerly desired by the participants.

If we’re not doing all this research to improve patient care, then probably we should change the consent process.

Challenging the interpretation of observations is fundamental to the scientific process.

Evidence is a wonderful thing. Our freedom to base our arguments on reproducible experiments dates back to the 17th Century, when people in Europe were finally permitted to openly discuss and debate science based on direct observation. Evidence is the backbone of scientific discourse, so it follows that papers without data can be easily dismissed as well-articulated speculation.

When a dataset is published, readers are then able to drill down to raw observations, and can verify methods or explore alternative explanations. Yes, this means they can potentially expose errors in your work and your thinking. But it’s far more common for readers to double-check the work against other published datasets, which can answer lots of different questions. Ultimately, this is a good thing for science.

Sharing data sharpens the mind.

The very real anxieties that come with data sharing are both individual and collective, because we are building knowledge together. Professional pride dictates that if your data will be open for inspection, you will be much more careful about the details. (After all that data cleaning and fixing, confounder/covariate discovery and adjustment, you do not want to be the one who left a howler for others to discover.)

Everyone knows there are skeletons in the data closet, mostly down to the complications of running real-life experiments, so current analyses make use of several approaches to boost confidence in the results. But generally speaking, just knowing your peers could be wandering through your data sharpens your mind and makes you focus on handling and presenting your analysis properly.

When an entire community does this, it benefits from a deeper consensus on what a “good study” looks like. That matters a lot.

Meta-analysis and Serendipity

When it can be done, meta-analysis (the combining of datasets) is a win–win–win (funders, scientists, patients). It’s about building on studies, combining them to gain new insights, asking different questions and finding new leads. Meta-analysis isn’t always possible – clinical trials often look at entirely different things, and even when they do study the same thing, they can’t always be aligned very well. But meta-analysis is only possible when people share their datasets.

Serendipity is another benefit of data sharing – I am always amazed at how important it is for science. Serendipity has guided us to some seriously profound insights, for example the relationship between the Malaria parasite and plants, or how metabolic enzymes can be used as lens crystals. It’s been behind many of the completely weird discoveries that make biology so wonderful, and many practical discoveries, such as CRISPR, that push the frontiers of possibility.

I’ve stumbled happily upon Serendipity many times, and very often others have made serendipitous discoveries based on data or methods I have published. You’d have to be pretty cynical to begrudge your fellow scientists such pleasure, and, frankly, a bit petty to fret over whether they’ll remember to credit you (nearly all scientists carefully reference their sources, if only to reassure reviewers of the credibility of the data they use).

For funders, both meta-analysis and serendipitous discoveries compound their return on investment and make them look good. For scientists, being able to make use of comparable data to verify or cross-validate their work, or to make unplanned discoveries, is invaluable. For patients, knowing their contribution is being used in lots of different and useful ways can give a sense of pride.

Sceptical about whether this really applies to clinical research? Well, without having access to a large number of trials, I doubt anyone could say.

Having more large datasets on hand for meta-analysis can only benefit those planning and analysing the results of clinical trials. And as clinical trials begin to incorporate more high-dimensional, data rich datasets (e.g. imaging, metabolomics, multi-omics) – and to share them – there will be plenty of opportunities to carry out sophisticated meta-analysis.

As for Serendipity, well, it can strike at any time.

The scoop

It is hardly possible for anyone to “scoop” you simply because you released your data on publication – particularly if that dataset represents only what is needed to support your paper. If someone else looks at that data and comes up with an interesting observation you missed, they can potentially make that corner of science a little bit better. Dwelling on the negatives will get you nowhere, but looking on the bright side may land you a new collaborator.

If the only datasets you share at publication time are those that relate specifically to that paper, there is no need for complicated embargo rules that provide authors enough time to perform a full analysis on all the data collected (as proposed in the most recent NEJM editorial). Tracking and versioning might become more complicated with later papers, but this approach does the important job of tying the datasets to the publication in a reasonable timeframe, opening up that piece of science for proper verification and discourse.

If you really believe you are going to be scooped for some missing analysis on a dataset, the solution is to delay publication. If you’re worried that making your data public will expose you to undue criticism, make your analysis bulletproof. That will be good for you and for the system as a whole, as understanding the strengths and weaknesses of different analyses only makes the community stronger.

When data sharing is not straightforward

Human subjects

No matter what, we have to honour patient consent. As scientists we may wish such agreements were more future-proof, but when those consents preclude data sharing beyond the study group, we have to accept it and move on.

Exactly how to future-proof consents for clinical trials is no simple matter. One solution would be for funding agencies or regulators to begin insisting that consent forms provide a reasonable level of research access, which would facilitate research but respect the privacy of individuals.

Currently, for genetic studies, there is a lightweight vetting process, involving both individual and institutional sign off, which assures patients that the researchers will perform appropriate research on the dataset. This is a clunky approach and it certainly needs improvement, but it is functional.

Standards and infrastructure

Data sharing is only feasible if the parties involved are able to do it, without worrying that they’ll run into trouble transferring files from one site to another, or that their data will disappear into some kind of black hole.

A robust, global archive for this kind of information would be one important piece of a larger infrastructure that would make biomedical data sharing straightforward. The EMBL-EBI model – biomolecular archives supported by international collaborations – is a solid example. Funding for infrastructure like this is huge value for money, and costs little in the context of global clinical research funding.

CDISC standards are functional, and well used by the clinical trials community. But there is a constant need to review standards and establish new ones for emerging technologies. This work never ends, but the end goal of harmonisation (i.e. to support meta-analysis) is a good one, and the whole process helps us along on our eternal quest for a shared language.

Regulatory and commercial concerns

I do not have a lot of experience in this area, but it’s clear that regulation of clinical trials is a huge deal for the pharmaceutical industry. Any data release policy needs to work well for the regulators, and for commercial interests, who can have different concerns from academia. For both, the science performed in clinical trials must be very sound, so that mind-sharpening step of data release is certainly of value, but most companies that I know are delighted when other science happens from the data they release.

Evidence is beautiful

In this on-going debate about data, let us base our arguments on… data. We are all likely to change our views view when presented with compelling data and well-reasoned analysis, which is one of the nice things about being a scientist.

Refreshingly, for the most part I do not think this debate is one of those boring political ones where everyone chooses a side, closes their ears and steels themselves for uncomfortable dinner-table discussions. Scientists already working in an open-data environment understandably campaign for everyone to join them – though they are full aware of the downsides. Scientists working in clinical trials can see there are advantages to sharing data, but have neither the time nor the inclination to sort out the myriad details that would make it workable.

As a starting point, we can focus on the simplest, tried-and-tested approach of publishing your data alongside your narrative – a practice that has served science well for over 300 years. But more importantly, we can keep the discussion going, and work with one another to overcome the barriers to realising the full potential of biomedical research. That would be a win for scientists, their funders and, most importantly, patients themselves.

Ewan's Blog: Bioinformatician at large

Tuesday 6 September 2016

Sharing clinical data: everyone wins