In science, publications are traditional disseminating vectors of
knowledge. The presented results are increasingly based on
underlying data and analytics. Data sharing together with
publications are therefore playing an important role in the quality
of the research. The purpose of this guide is to familiarize
yourself with the steps needed to share Data linked with
publications.
Why sharing data with publications?
Open data promotes transparency and reproducibility of the scientific
process. It helps making trustful the results presented in a
scientific publication. By making data available to everyone, it
promotes their reuse by the scientific community and their handling by
citizens, and allows their mobilization within the framework of the
debate audience. Opening the data means exposing them to criticism. To
do so prepare, the strategy consists in documenting them but also in
putting put in place, throughout their life cycle, management actions
to preserve their quality. The data thus made available benefit from
increased traceability and increase their potential for reuse,
including for their producer. Open Diffusion data ensures the
recognition of their producers and their visibility as well as that of
their establishment, in the same way as the scientific publication
ensures the visibility of its authors. A game of open data are more
visible and therefore more likely to be used in another research
project, then quoted from analogous to a publication. The people who
are there associates see their involvement valued. It is also to be
noted that a publication accompanied by the data are more cited 1 .
Data produced during a research project and may have the value and an
interest beyond the project, and sometimes the discipline
initial. Making them available makes it possible to fully exploit
their potential, thus promoting interdisciplinarity and collaboration
academic.
How to share data with publications?
When writing a scientific article, the authors adopt naturally a
pedagogical approach consisting in clearly defining all notations and
conventions used in the article, in describing the assumptions, the
framework as well as the state of art, in order to facilitate its
reading and allow its comprehension. data are part of this
approach. If we want the shared data to be useful to the scientific
community, the same attention must be paid to their publication.
Data preparation and documentation
Describe the data in order to make it intelligible to anyone not
having participated in their production constitutes a preliminary step
to their dissemination. Information on the origin of the data, the
assumptions or constraints related to their production and the
experimental protocols associated with it must be part of the
descriptive information given with the data: the metadata. There are
generic metadata standards which are domain specific. To support this
process of continuous data management, we can also rely on a Data
Management Plan, which is a document defining the procedures for
monitoring and describing the data.
When time comes to share data, several elements must be taken into
account. Some data are affected by legal constraints that prevent
their sharing or make it necessary to anonymization or authorization
requests. Each research establishment has its own policy of data
openness, constrained by the legislation, which forms an important
prerequisite for choosing the means for sharing data.
It is recommended not to entrust the publishers for sharing the data,
who offer to publish them under form of “supplementary data” or
“supplementary materials”. Such a publication is often done in a
format and an environment that does not allow to document the data
correctly, which makes it difficult to reuse by others. It may also
be accompanied by a request for the exclusive transfer of rights which
is in contradiction with state laws, and the spirit of open science.
Finally, in some cases, it makes scientists captive of the
environments controlled by major scientific publishing companies.
It is therefore rather recommended to share data in institutional
repositories, either general or discipline specific, which avoids such
pitfalls and offer documentation oriented environment, allowing
consultation and reuse of open research data. Correctly linking the
published datasets and the article then becomes a necessity and an
approach to be anticipated.
Repository choice
In the case of structured disciplines for data sharing (astronomy,
genomics, etc.), data producers have to layout of warehouses specific
to their discipline. They will then naturally use all the standards
and good practices already in place to document and format their
data. The practice of his community is the best guide, but directories
of these repositories exist4.
Alternatively, data producers can turn to the institutional repository
with which they are affiliated, if any, or use the multidisciplinary
Research Data Gouv warehouse. In these both cases, minimum
requirements will be imposed by the warehouses and responsibility for
ensuring the quality of data documentation will be borne more by the
depositor.
The National Gouvernment Data Research Warehouse
The national platform
Research Data Gouv offers a multidisciplinary data warehouse which
will be operational from 2022: it ensures French sovereignty on the
data, complies with French and Community law, guarantees the
durability and indexing of the stored data, according to the FAIR
principles. It is the warehouse of choice when no warehouse
disciplinary does not exist.
Regardless of the warehouse chosen to share data, it must
in particular offer the following features:
The assignment of a permanent identifier (Persistent Identifier:
PID) of the DOI type which makes it possible to cite the data (for
example http://dx.doi.org/10.15497/RDA00027) and constitutes the basic
brick to link to other research products such as publications.
The description of the data at a sufficient level to facilitate
discovery, understanding and reuse (metadata standardized
descriptions, controlled disciplinary vocabularies).
The use of licenses and the definition of access rules allowing
reuse to be included in a well-defined legal framework and compatible
with French and European law.
A minimum shelf life of several years, consistent with the
institution’s data retention policy.
Link data to publications
Several options are available to establish the link between a article
and the data associated with it before the publication of the item
under consideration. It is then easy to create the link
between the article and the associated data, according to the
methods described in the diagram on the following pages. Likewise,
referencing data-related publications (including data papers) is
generally possible in all data warehouses, even after the initial
deposit. Conversely, indicate the explicit link to data after the
publication of an article is most often impossible at present. A
workaround is to refer to the data in the version of the article
deposited in an open archive (HAL for example) which allows everyone
to learn time of persistent identifiers linked to publications in
fields specific “Associated Data” of the record. This scheme therefore
allows the reciprocal link between publications and data, but only for
the version deposited in the open archive.
Data papers
A data paper is a publication whose purpose is the
description of a set of scientific data. Unlike a classic research
article, the data paper consists of a detailed description of the
scientific data, their metadata, as well as the circumstances and
methods of their collection, but without analysis or interpretation of
these data. The data described must be accessible (as far as
possible), deposited in an appropriate warehouse, and provided with a
permanent DOI-type identifier. A data paper is published in the form
of a peer-reviewed article, guarantee of its quality, and can be
quoted in the same way as an article “ classic”. Therefore, the author
of a data paper must be convincing as to the quality and scientific
scope of the data (including their potential for reuse). It can be
published in specific journals (data journal) or in scientific
journals traditional that allow this format.
Cite a dataset
How to cite a dataset linked to a scientific publication depends on
the circumstances of production of this data:
If the data was produced and shared during the drafting of the
article, it is recommended to introduce a section specific “Data
availability” before the references bibliographic. For example:
Availability of data Games of data related to this article can be
found at https://doi.org/10.23708/PQTQDA, an online code-based data
repository open source hosted by DataSuds IRD (Granjon and Fossati,
2020)
If the data has already been produced and shared in another
framework than that of the publication, the quotation is made in the
references in a form equivalent to that of the references
bibliographical, for example:
Van Halder, Inge; Sacristan, Alberto ;
Martín-García, Jorge; Pajares, Juan Alberto; Jactel, Herve, 2022,
“Monochamus galloprovoncialis catches and pine tree composition in
different landscape buffers in Spain”,
https://doi.org/10.15454/JXFGPI, INRAE Data Portal, V1
Proper citation of data allows for better indexing and therefore a
better discovery when searching and gives credit permanently to the
data producer.