Envisioning a Biomedical Data Reuse Registry

Heather A. Piwowar and Wendy W. Chapman
University of Pittsburgh, Pittsburgh, PA

To appear as a poster at AMIA 2008 .
For more information on our data sharing research please email or visit!

Abstract

Repurposing research data holds many benefits for the advancement of biomedicine, yet is very difficult to measure and evaluate. We propose a data reuse registry to maintain links between primary research datasets and studies that reuse this data. Such a resource could help recognize investigators whose work is reused, illuminate aspects of reusability, and evaluate policies designed to encourage data sharing and reuse.

Motivation

The full benefits of data sharing will only be realized when we can incent investigators to share their data1 and quantify the value created by data reuse.2 Current practices for recognizing the provenance of reused data include an acknowledgment, a listing of accession numbers, a database search strategy, and sometimes a citation within the article. These mechanisms make it very difficult to identify and tabulate reuse, and thus to reward and encourage data sharing. We propose a solution: a Data Reuse Registry.

What is a data reuse registry?

We define a Data Reuse Registry (DRR) as a database with links between biomedical research studies and the datasets used within the studies. The reuse articles may be represented as PubMed IDs, and the datasets as accession numbers within established databases or the PubMed IDs of the studies that originated the data.

How would the DRR be populated?

We anticipate several mechanisms for populating the DRR:
  • Voluntary submissions
  • Automatic detection from the literature3
  • Prospective submission of reuse plans, followed by automatic tracking
We envision collecting prospective citations in two steps. First, prior to publication, investigators visit a web page and list datasets and accession numbers reused in their research, thereby creating a DRR entry record in the DRR database. In return, the reusing investigators will be given some best-practices free-text language that they can insert into their acknowledgments section, a list of references to the papers that originated the data, some value-add information such as links to other studies that previously reused this data, and a reference to a new DRR entry record. When authors cite this DRR within their reuse study as part of their data use acknowledgement, the second step of DRR data input can be done automatically: citations in the published literature will be mined periodically to discover citations to DRR entries. These citations will be combined with the information provided when the entry was created to explicitly link published papers with the datasets they reused. The result will be searchable by anyone wishing to understand the reuse impact made by an investigator, institution, or database.

How would the DRR be used?

Information from the DRR could be used to recognize investigators whose work is reused, illuminate aspects of reusability, examine the variety of purposes for which a given dataset is reused, and evaluate policies designed to encourage data sharing and reuse.

Conclusion

While the DRR may not be a comprehensive solution, we believe it represents a starting place for finding solutions to the important problem of evaluating, encouraging, and rewarding data sharing and reuse.


Acknowledgments

HP is supported by NLM training grant 5T15-LM007059-19 and WC is funded through NLM grant 1 R01LM009427-01.

References

1. Compete, collaborate, compel. Nat Genet. 2007;39(8).
2. Ball CA, Sherlock G, Brazma A. Funding high-throughput data sharing. Nat Biotechnol. 2004 Sep;22(9):1179-83.
3. Piwowar HA, Chapman WW. Identifying data sharing in the biomedical literature. Submitted to the AMIA Annual Symposium 2008.