How to describe data before publication?
Choosing to open data is a good step, but it’s not enough. Data need to be identified by specific descriptors so it can be found on the Internet and reused, if appropriate. It is strongly advisable to adopt a methodology from the beginning of a project. This avoids having to rework the data sets a few months – or even a few years – after their generation (see also the articles on data management plans). Retrospective description is an obvious cause of data loss.
Data repositories generally have user guides for data description. They vary with domains, but can be used as a model if there are no standards for a certain field, or if you are unaware of them (1). When possible for your discipline (e.g. crystallography and astronomy), using existing metadata standards is recommended. Refer to the main standards in Physics and Chemistry.
Most of the time, repositories provide a metadata base which can be used to at least find a data set. This usually corresponds to the Dublin Core standard, developed in 1995 which consists of 15 relatively basic elements (title, developer, topic, date, etc.) (2). Using basic metadata (Dublin Core type) and a few discipline-specific metadata can increase the visibility of your research. The richer the metadata, the more easily your associated research will be found.
CERN has developed a specific framework, suitable for the huge volumes of data produced. “Reproducibility requires more than just open access,” explain CERN researchers in an article published in Nature Physics (3) in 2018. Their challenge is to describe data in particle physics which is known to be unique for the instruments used. What’s more, huge volumes of data have been obtained, making it difficult to replicate. That’s why good data description is important, to optimize other communities being able to use them. To that end, CERN has developed several services and tools for standardizing file descriptions and analysis in the field of particle physics.
CERN Analysis Preservation is one of these services. It is an open-source web platform where researchers can deposit all files and documents having led to an analysis. The deposit involves several steps which require validation (analysis purposes, people involved, data source – instruments, databases, etc. -, data itself, analysis software used, final results – graphs, text, etc.). Other additional information can be added, such as bibliographical references or internal discussions between researchers. Detailed steps are presented in the documentation here.
These prerequisites provide common based metadata and information for full understanding of all the issues underlying the analysis. In addition, information is modeled in JSON, an open standard data format, and accessible via an API (Application Programming Interface). This makes it possible to electronically recover the information on the platform. It is also possible to restrict access to analyses to only a few collaborators and to establish an embargo to delay open publication. Of course, the platform is intended for physics researchers collaborating with CERN in any way and using the associated large-scale research facilities. However, it demonstrates a process which can be implemented and gives ideas about criteria to be considered for useful data description.
The French very large-scale research facilities (Très Grandes Infrastructures de Recherche, known as T.G.I.R)
French initiatives in physics and chemistry have also emerged from the large-scale research facilities backed by the CNRS, including the Institut Laue-Langevin (ILL) in Grenoble which has its own data portal and a rather advanced data policy. Metadata to be entered and compiled in a database, include the configuration of instruments or the description of the sample obtained. Best practices are relatively easy to establish even without a well-defined institutional framework, such as enclosing a “readme” file with data which describes in detail how the data was produced and for what purpose. For example, see here the readme construction guide for the 4TU data repository associated with the Twente University and the Delft University in the Netherlands.
The University of Cambridge example
Researchers at the University of Cambridge conducted a project as far back as the mid-2000s called Spectra (4) aimed at improving the reuse of data in organic chemistry, crystallography and computational chemistry (5). A web application was developed to facilitate metadata collection before depositing data sets in the institution’s Dspace repository. Some of the metadata chosen are very basic (data developer, date), while others are more specific (chemical formula, InChi identifier, etc.). Thought (6) was given to the continuation of the Spectra project to improve chemistry data deposits. One of the goals was to automate part of the metadata it was automatically entered for every set of data from the same collection. For example, algorithms are used to generate the InChi identifier for a molecule. The information is then added to the DataCite Metadata Scheme in the “subject”part. Using the metadata, DataCite can assign a DOI to every collection and set of data. See an example here.