Describing your data

Example from the Carnegie-Mellon Chemical Engineering Department in the US

For more effective research data sharing, some groups have improved methods in the publication process by directly integrating data.

This is the case for the Carnegie-Mellon Chemical Engineering Department in the US, which has developed its own process for integrating data into an article and making it both human- and machine-readable. (1)

Firstly, they suggest integrating files directly into the PDF containing the article’s additional information (supporting information file) so the data can be extracted and used. A PDF reader compatible with file integration such as Adobe or Foxit is required.

Image taken from the PDF supporting information file of the ACS (American Chemical Society) article 

An example of reusing data:

New calculations are to be made using specialized software and the parameters present in a JSON file, integrated in the supporting information PDF (supplementary materials) of the article. To do so, several data files (the parameters) must be imported by the software in a particularly specific format.

A Python code featured in the explanatory PDF is used to read the JSON file, and to retrieve the required parameters and along with the structures in the formats required by the software. This very specific example reuses the data automatically, but obviously, it is also possible to open the data file and process it manually.

This method is quick and easy to use. But if data files are too big, they cannot be attached. How useful the method is depends on the discipline (for example, astrophysics data files are very large). What’s more, it will be necessary to add data to every document version, which can be laborious.

The second solution proposes the association of the Emacs text editor with Org-mode, an advanced document editing syntax (similar to LaTex). The aim is to identify the text boxes, code boxes and tables for the data using a syntax defined by Org-mode. With these two tools, interactive data can be integrated into a document, which can then be exported to PDF, HTML, LaTeX or other format. What’s more, the .org file is in a standard txt format.

It’s an advantage because this solution places data tables directly in a PDF that is easy to read for a user, and a structured file in the .org format, easy to read for a machine if the data is to be reused. The structured file is integrated into the explanatory PDF using the first method.

The advantage of such a version is that the data are almost always human- and machine- readable. Moreover, since data are marked up in a txt file, the volume will be smaller than a raw data file.

On the other hand, this solution requires learning time if the user is not familiar with this type of publishing. The author of the article suggests that the learning time is justified because once the method is mastered, publication preparation time is reduced.

Note that certain metadata standards are already available for documenting data. On the other hand, there are no specific standards for each discipline. This solution means data can be shared quickly and correctly if there is no better alternative.

  1. Kitchin, John R. “Examples of Effective Data Sharing in Scientific Publishing.” ACS Catalysis, vol. 5, no. 6, June 2015, pp. 3894–99. DOI.org (Crossref), doi:10.1021/acscatal.5b00538.