In an open ocean, shouldn't open data be the norm?

By Kate Crosby, Data Manager, Canadian Healthy Oceans Network

Photo credit: Wordle

Photo credit: Wordle

“Open access” and “open data” are two terms that are being more than casually tossed around of late in all science disciplines, and ocean science is no exception. I’m going to address the problem of open data in this post rather than open access (an over-arching and linked topic to open data).

Data wranglers contact investigators and get the data branded and put in the digital repository corral. Nature 461:160-163

Data wranglers contact investigators and get the data branded and put in the digital repository corral. Nature 461:160-163

What does open data mean to the primary generators and “getters” of data - the academic/government/industry researchers who pour their blood, sweat, time, and tears - and what does it mean to the general public/policy makers/ government officials? My own answer and interpretation of this problem comes largely from an outsider’s viewpoint.

CAVEAT: I am not a marine scientist (I work in a terrestrial system & focus on evolutionary biology). I am also a data manager tasked with chasing down, filtering, archiving, and uploading the data produced from the Canadian Healthy Oceans Network (CHONe), and we have until midway through 2014 to do it.

I view the ocean as an open system - unlike terrestrial systems, where political and geographic boundaries are easily delineated. Most of the ocean does not belong to any country or corporation; indeed its invisible geographic features are still being “sounded.” If the ocean is an open system, should the data collected be eventually open to the public and research community alike? 

 

Challenges to Open Data

The biggest challenge facing a data manager is not computational or even organizational; it’s getting researchers to release raw data or processed data that they hold as “proprietary”. Perhaps a better way of putting it is that researchers are “stewards” of data. Researchers are attached and invested in their data, and it’s no wonder why.

Digital repositories designed to preserve and share just about any kind of digital data researchers could produce for the most part lie empty, even though researchers were supportive of the idea. Because when the time came for researchers to deposit their data they couldn't find it or didn't understand how to use the archive, or said they didn't have any time to do it.  Nature 461: 160-163

Digital repositories designed to preserve and share just about any kind of digital data researchers could produce for the most part lie empty, even though researchers were supportive of the idea. Because when the time came for researchers to deposit their data they couldn't find it or didn't understand how to use the archive, or said they didn't have any time to do it.  Nature 461: 160-163

An apt way of summing up the viewpoint of some researchers in the present day dates back to an anecdote in 16th century. Prior to the birth of the Royal Society in the UK in 1660, the old attitude surrounding sharing data between researchers could be summed up from a quote from the 1500s by an Italian mathematician Girolamo Cardano requesting in a letter to look at another mathematician’s formula:

"I swear to you by God’s Holy Gospels and as a true man of honor, not only to never publish your discoveries, if you teach me them, but I also promise you, and I pledge my faith as a true Christian, to note them down in code, so that after my death no one will be able to understand them.”

All joking about writing in code aside, both academic and government positions require their scientists to publish peer-reviewed scientific papers, but data are still largely left out of the current equation. Papers primarily remain the main metric of productivity - and the main metric for promotion (or tenure) and successful awarding of grants. Money from these grants largely originates from the public taxpayers’ purse (the tri-council of NSERC, SSHRC, and CIHR). Papers are often published in journals that remain behind paywalls (charging sometimes exorbitant subscription fees for public or institutional access), and data affiliated with these papers may not be visible EVER. But if data and papers are the product of researchers using public funds shouldn't that mean that both the knowledge of papers and the data generated go back to the public in the end?

Data allow a researcher to generate a heavily edited summary or story (publication) of a process. However, when it comes to obtaining grants to continue to do science, the data are not weighted the same way papers are weighted. Ideally, the two forms should be submitted as a package (see Scenario 2 below). For example, a single researcher, Dr. Jane Doe, would have three routes to go in publishing her contribution to research:

Scenario 1: Collect empirical data, but don’t publish the data + Publish paper with data = 1 productivity point

Scenario 2: Collect and publish empirical data + Publish paper with data = 1 productivity point

Scenario 3: Collect and publish one or five empirical data sets = 0 productivity points (at present)

There are several things wrong with these scenarios. First, there’s no reward for collecting and publishing data alone or collectively with a paper - it simply does not “count” toward a scientist’s “productivity” when it comes time to calculate grant money and funding. The collection of data does not always allow for publication because of low statistical power (low sample size with which to draw well-supported conclusions), or because the data at the present time yield only a “negative” result. The “negative” result I refer to here is tied into well-known publication bias towards “success stories” when writing up inferences about data. The second problem is that funding agencies award the summary and thought process (not to denigrate that thought process) without allowing access to the entity (the data) that allowed the publication to proceed in the first place. If there are no open data, there is simply no way of performing a re-evaluation of inferences from these data, or potentially even the peer-reviewed inferences a paper makes. Reproducibility/ repeatability is one of the first tenets of good science.

Data miners search and retrieve data from many sources to reuse in larger meta-analysis. Nature 461: 160-163

Data miners search and retrieve data from many sources to reuse in larger meta-analysis. Nature 461: 160-163

 Allowing a researcher to ‘hide’ data, perhaps due to misplacing or inadequately archiving said data (e.g. forgotten hard drive stored in an attic) is a misuse (albeit often unintentional) and to some extent, an abuse (if data are intentionally concealed) of public funds.

There is actually another “problem” with the above scenarios. I used the word “empirical” to describe data, when what I meant was empirical primary data generated from the field or lab mensurative and manipulative experiments. I did not talk about simulated data, or empirical data (as described by me) being re-used and combined to make into larger secondary BIG DATAsets to carry out a larger meta-analysis. I will not address this problem as it’s been summed up quite well by Brian McGill on Jeremy Fox’s blog “Dynamic Ecology”.

 

Digital permanent archiving of data an important first step

Data are (A) acquired, checked for quality, documented, and then both the raw and derived data products are versioned and deposited in a public digital repository. Researchers can discover and access data from the repository and then (B) integrate and process the data, which results in derived data products, visualizations, and scholarly papers that are in turn archived. Science 331: 703-705

Data are (A) acquired, checked for quality, documented, and then both the raw and derived data products are versioned and deposited in a public digital repository. Researchers can discover and access data from the repository and then (B) integrate and process the data, which results in derived data products, visualizations, and scholarly papers that are in turn archived. Science 331: 703-705

I would argue that it’s extremely important to change researchers’ attitudes about publishing and sharing data. To get researchers onside (onboard?), we essentially need to factor published data into the funding equation as a “metric of productivity”, and this is hopefully coming for the Canadian funding agencies in the very near future. Ideally, researchers would publish both data along with a peer-reviewed paper for context. This is the approach that certain DOI-granting archive non-profit organizations have chosen - the NESCENT based DRYAD digital repository. Other types of digital repositories, such as FigShare, are free, but do not require a published paper to accompany datasets. Public universities in Canada are also working on establishing DOI-granting permanent free archives for their researchers.

Archiving the original dataset from the researcher in one of these repositories is crucial to preservation of data, ensuring they cannot be manipulated or changed after the fact (or a copy of the data could be checked against the original data if need be).

 

Where is CHONe now and where are we going?

Convincing network researchers to safely and publicly archive data, is the point I am at right now with CHONe. The goal is that we will eventually build a database with all collected data. Fortunately, CHONe scientists have been quite helpful and productive along the way. We are currently up to six data packages in Dryad and more are coming. Nevertheless, there is still some resistance, as researchers are busy people, with busy schedules. So inevitably, a number of questions come to my mind in asking researchers to return publicly funded data to an open, permanent archive:


So how do you view data collected by you or a researcher? Is it yours? Are you the steward or the proprietor?

Does the public have a right to know what you accomplished with the money they provided you?

If published datasets were counted towards a measure of productivity similar to the way publications are, would that enable you to share more frequently? 

What are your actual concerns with permanent public archival?