Open Access Month – Open the Data

October is Open Access Month. Throughout the month, guest contributors will present their perspectives on the value of open access to research, scholarship and innovation at The University of Texas at Austin.

This installment provided by Spencer J. Fox (ORCID ID: 0000-0003-1969-3778), PhD candidate focusing on computational epidemiology.

Three years ago, I was choosing the next research direction for my PhD. I was interested in two subjects and had found a journal article in each to build upon. I thought to follow the computational biologist’s path of least resistance: pursue the paper whose results I could reproduce first, as that represents an important first step. One of the papers had published a repository with all of their data alongside working code for analyzing it, while the other had simply stated: “Data available upon request” with no reference to code used for the analyses.

Being a naive graduate student, I politely reached out to the authors of the second study to obtain their data and inquire about their code. In return, I received a scathing email filled with broken links to old websites, excuses about proprietary data, and admonishment for having asked for “their” code: “any competent researcher in the field could replicate our analysis from the information within the manuscript.” I was stunned.

While expressing my frustration to my peers, I found that their requests had also been met with equal hostility and degradation from scientists in their respective fields. When data or code had been provided – usually after months of negotiations – cooperation came with heavy stipulations in article authorship, time-stamped embargos, or permissible analyses. Clearly, it’s not enough to rely on researchers to act in good faith.

The unfortunate truth is that the onus falls on journals to enact real change. Many major journals now require that raw data be deposited in permanent online repositories like Dryad¹. This has improved data sharing, but is only half the battle and simply provides the likeness of reproducible research. I have spent weeks reproducing someone’s analysis using their provided data and code. It would have been impossible without both. Simply put, freely available code – even if messy and difficult to follow – provides an invaluable foundation for future researchers to build upon, and all journals should require that both analysis code and data accompany a manuscript.

Too many conscious and subconscious coding decisions are made over the course of a project that even minor decisions early on present serious stumbling blocks for researchers trying to reproduce results. Differences in mundane behaviors between programming languages, versions, library functions, and self-written pipelines can have drastic implications on end results. A great example of this is the inadvertent errors in one fifth of genomics papers attributed to Microsoft Excel use².

Finally, while ultimately it is the researcher’s responsibility to provide code alongside a manuscript, there are tangible incentives for doing so: citations. Open access manuscripts and those that provide their data receive more citations^3,4, and the same likely applies to providing analysis code. After debating between those articles three years ago, I alone have cited the reproducible paper in two separate publications. How many other potential citations are lost “upon request”?

Citations

TexLibris

Open Access Month – Open the Data

Leave a Reply Cancel reply

UT Libraries