This text is taken from a translation of the guide published on the french site www.ouvrirlascience.fr

On the sharing of data associated to scientific publications (www.ouvrir_la_science.fr)

In science, publications are traditional disseminating vectors of knowledge. The presented results are increasingly based on underlying data and analytics. Data sharing together with publications are therefore playing an important role in the quality of the research. The purpose of this guide is to familiarize yourself with the steps needed to share Data linked with publications.

Why sharing data with publications?

Open data promotes transparency and reproducibility of the scientific process. It helps making trustful the results presented in a scientific publication. By making data available to everyone, it promotes their reuse by the scientific community and their handling by citizens, and allows their mobilization within the framework of the debate audience. Opening the data means exposing them to criticism. To do so prepare, the strategy consists in documenting them but also in putting put in place, throughout their life cycle, management actions to preserve their quality. The data thus made available benefit from increased traceability and increase their potential for reuse, including for their producer. Open Diffusion data ensures the recognition of their producers and their visibility as well as that of their establishment, in the same way as the scientific publication ensures the visibility of its authors. A game of open data are more visible and therefore more likely to be used in another research project, then quoted from analogous to a publication. The people who are there associates see their involvement valued. It is also to be noted that a publication accompanied by the data are more cited 1 . Data produced during a research project and may have the value and an interest beyond the project, and sometimes the discipline initial. Making them available makes it possible to fully exploit their potential, thus promoting interdisciplinarity and collaboration academic.

How to share data with publications?

When writing a scientific article, the authors adopt naturally a pedagogical approach consisting in clearly defining all notations and conventions used in the article, in describing the assumptions, the framework as well as the state of art, in order to facilitate its reading and allow its comprehension. data are part of this approach. If we want the shared data to be useful to the scientific community, the same attention must be paid to their publication.

Data preparation and documentation

Describe the data in order to make it intelligible to anyone not having participated in their production constitutes a preliminary step to their dissemination. Information on the origin of the data, the assumptions or constraints related to their production and the experimental protocols associated with it must be part of the descriptive information given with the data: the metadata. There are generic metadata standards which are domain specific. To support this process of continuous data management, we can also rely on a Data Management Plan, which is a document defining the procedures for monitoring and describing the data.

When time comes to share data, several elements must be taken into account. Some data are affected by legal constraints that prevent their sharing or make it necessary to anonymization or authorization requests. Each research establishment has its own policy of data openness, constrained by the legislation, which forms an important prerequisite for choosing the means for sharing data.

It is recommended not to entrust the publishers for sharing the data, who offer to publish them under form of “supplementary data” or “supplementary materials”. Such a publication is often done in a format and an environment that does not allow to document the data correctly, which makes it difficult to reuse by others. It may also be accompanied by a request for the exclusive transfer of rights which is in contradiction with state laws, and the spirit of open science. Finally, in some cases, it makes scientists captive of the environments controlled by major scientific publishing companies.

It is therefore rather recommended to share data in institutional repositories, either general or discipline specific, which avoids such pitfalls and offer documentation oriented environment, allowing consultation and reuse of open research data. Correctly linking the published datasets and the article then becomes a necessity and an approach to be anticipated.

Repository choice

  • In the case of structured disciplines for data sharing (astronomy, genomics, etc.), data producers have to layout of warehouses specific to their discipline. They will then naturally use all the standards and good practices already in place to document and format their data. The practice of his community is the best guide, but directories of these repositories exist4.

  • Alternatively, data producers can turn to the institutional repository with which they are affiliated, if any, or use the multidisciplinary Research Data Gouv warehouse. In these both cases, minimum requirements will be imposed by the warehouses and responsibility for ensuring the quality of data documentation will be borne more by the depositor.

The National Gouvernment Data Research Warehouse

The national platform Research Data Gouv offers a multidisciplinary data warehouse which will be operational from 2022: it ensures French sovereignty on the data, complies with French and Community law, guarantees the durability and indexing of the stored data, according to the FAIR principles. It is the warehouse of choice when no warehouse disciplinary does not exist.

Regardless of the warehouse chosen to share data, it must in particular offer the following features:

  • The assignment of a permanent identifier (Persistent Identifier: PID) of the DOI type which makes it possible to cite the data (for example http://dx.doi.org/10.15497/RDA00027) and constitutes the basic brick to link to other research products such as publications.

  • The description of the data at a sufficient level to facilitate discovery, understanding and reuse (metadata standardized descriptions, controlled disciplinary vocabularies).

  • The use of licenses and the definition of access rules allowing reuse to be included in a well-defined legal framework and compatible with French and European law.

  • A minimum shelf life of several years, consistent with the institution’s data retention policy.

Cite a dataset

How to cite a dataset linked to a scientific publication depends on the circumstances of production of this data:

  • If the data was produced and shared during the drafting of the article, it is recommended to introduce a section specific “Data availability” before the references bibliographic. For example: Availability of data Games of data related to this article can be found at https://doi.org/10.23708/PQTQDA, an online code-based data repository open source hosted by DataSuds IRD (Granjon and Fossati, 2020)

  • If the data has already been produced and shared in another framework than that of the publication, the quotation is made in the references in a form equivalent to that of the references bibliographical, for example:

    Van Halder, Inge; Sacristan, Alberto ; Martín-García, Jorge; Pajares, Juan Alberto; Jactel, Herve, 2022, “Monochamus galloprovoncialis catches and pine tree composition in different landscape buffers in Spain”, https://doi.org/10.15454/JXFGPI, INRAE Data Portal, V1

Proper citation of data allows for better indexing and therefore a better discovery when searching and gives credit permanently to the data producer.

In short: share data linked to scientific publications

To prevail

  • Submit your data before publishing your article, and thus link your data to the article by mentioning the permanent identifier of the data.

  • Deposit your data in an independent dedicated data warehouse (disciplinary or institutional).

To avoid

  • Share data linked to a publication

  • Deposit your data in a warehouse after publishing your article.

  • Entrust the data to the editor of the journal for distribution under form of supplementary materials.

Glossary

  • Research data: factual documents (numerical notes, textual documents, images and sounds, etc.) used as sources primary for scientific research, and which are commonly accepted in the scientific community as being necessary for validate the search results. For further : https://legalinstruments.oecd.org/en/instruments/OECDLEGAL-034

  • Data warehouses: platforms on which are deposited, described and stored datasets of the research. Warehouses can be generalist or disciplinary.

  • FAIR: set of principles aimed at supporting research in facilitating the reuse of data. Easy to find (Findable), Accessible (Accessible), Interoperable (Interoperable), Reusable. For further : https://www.ouverturelascience.fr/fair-principles/

  • Metadata: set of structured information that describes, explicit, locates an information resource, with the aim of facilitate research, use, and management. For further : https://www.niso.org/publications/understanding-metadata-2017

  • PID: permanent unique identifier - License: mention defining the data reuse conditions

Cited references

  1. Colavizza G, Hrynaszkiewicz I, Staden I, Whitaker K, McGillivray B (2020). The citation advantage of linking publications to research data. PLOS ONE 15(4): e0230416. https://doi.org/10.1371/journal.pone.0230416

  2. https://doranum.fr/metadonnees-standards-formats/fichesynthetique/

  3. https://doranum.fr/plan-gestion-donnees-dmp/minute/

  4. https://repositoryfinder.datacite.org/

  5. https://doranum.fr/aspects-juridiques-ethiques/leslicences-de-reutilisation-dans-le-cadre-de-lopen-data-2/

To go further