Describing your data

Making your research more reproducible

Contents

Data documentation is now a major issue for data management plans and the debate on scientific reproducibility. It is important to consider it beforehand. Which tools should you use? Below you will find an inexhaustive list of options with an example of a course given at the Lyon 1 University to illustrate the growing importance of the issue.

All research refers to existing work, yet judging the reliability of data without information about the method or any comments or documentation… Several issues are involved:

  • Clarifying the variables used
  • For large and complex files, provide a method for standardizing and automating documentation creation.

Various tools covering many languages and different software can help. They generally have similar processes: add comment blocks in the script used to produce your data using a structure predefined by the tool. The tool can then automatically generate a document by executing the file to “read” the comment blocks.

Clear distinction must be made between those read by the machine to generate documentation, and individual comments produced for explaining code (1). Generated content is accessible in different formats, mostly in pdf or html in a web browser.

Markdown for data reproducibility

Markdown is a markup language developed in 2004. It can be used to divide and structure a document according to predefined syntax. Read with a plain text editor, it is easy to use and the syntax is easy to understand. Despite very few changes since its development, Markdown is still frequently used. For example, all the “readme” files on GitHub are written in Markdown, and in other software such as R with the R-Markdown package. R Markdown is used to create rich documents which contain code, text, data tables and graphs. Document formats can be standard (PDF, HTML, etc.). R Markdown also exports into other widely used formats such as Word, OpenOffice, or PowerPoint.

For optimal handling and data entry, we recommend using the free open-source development environment RStudio which includes R Markdown. This gives direct access via the software to the script window, enabling you to write and organize code structure and preview the final R Markdown document. You can change the document as you progress and it’s easy to make corrections.
In this example, the script where the content of the document is written (filled boxes on white background) and the R commands (gray background) are displayed in the left-hand window. The document generated after execution of the script is shown on the right. The R Markdown package supports other languages such as Python and SQL.

Courses on data reproducibility for master students at Lyon 1

Since 2019, NanoScale Engineering Masters students at the Lyon 1 University have been learning to code to improve the reproducibility of their work. This course was initiated by Colin Bousige, CNRS researcher in Physics at the Laboratory of Multimaterials and Interfaces (LMI). It teaches Masters 1 students the basics of R for processing reproducible experimental data and how to use R Markdown notebooks.
The course explains the basics of R Markdown and the first elements of syntax, while encouraging students to go further using the multiple references provided. The website itself is written in R Markdown and the code can be downloaded from the top banner. There is also a list of other languages supported by R Markdown.
Another on-line course is available on FunMooc organized by three researchers: Arnaud Legrand, Christophe Pouzat, and Konrad Hinsen. The course about research reproducibility and transparent science, entitled “Recherche reproductible: principes méthodologiques pour une science transparente“, presents tools including Markdown, Jupyter, Rstudio and Org-mode.

Jupyter notebook

Developed in 2015, Jupyter Notebook (2) is a web application for creating notebooks which contain both code and text in Markdown. It is of course possible to enclose graphs, tables, mathematical formulae, and widgets for interacting with the document. The main languages supported by Jupyter are Python, R, and Julia. The community has made other languages available, but they are not as well integrated. There are two options for installing Jupyter Notebook:

  • The easiest option is to install the Anaconda extension which can be downloaded from their website. This will give you Jupyter and Python directly, and other tools as well, like RStudio or Spyder (Python development environment).
  • If you already have Python, install Jupyter Notebook via the pip command You will find more information on this page.
  • If you find it difficult to get used to the basic Jupyter Notebook interface, try nteract.io, a tool for easy creation of Jupyter notebooks with an extension. Simply install it on your computer.

As for R Markdown, you can export your notebook in different formats (html, PDF, LaTeX, slides, etc).

See the demo version for a preview of the interface before installation. It is available on the Jupyter website for the main languages: Python, R, Julia, C++, and Ruby. You will find the demo here. The interface is similar to the most basic text editors. You can add blocks (cells) to the extension and choose the type and content they will display: a title, text, code or Markdown for rich content. Then simply execute your blocks to display the final result.

The Jupyter project is also openly developed on GitHub and supported by the European Horizon 2020 program. It is also mentioned in the annual report on digital infrastrutures in higher education and research, emphasizing that students need to have “Jupyter notebooks hosted in French academic clouds” in order to maintain data “sovereignty”.

Org-mode for a proper data documentation

In some cases, tools can be used to enrich a text page by including code and tables to develop structured data documentation. The Carnegie-Mellon Chemical Engineering Department in the US has adopted this approach based on the association between Emacs, a text editor and Org-mode, advanced syntax for document editing. The combination can be used to produce structured final documents in different formats (PDF, html or LaTeX) and to document their data. This documentation has the same role as a readme file often found data sets. However, readme files are usually simple text files with various degrees of readability. Being easy to create, they are practical, providing at least a description of the data origin, the goal set and the method of generation (3), making it more readable for people outside the group.

R, Pydoc, and Sphinx for packages

When handling electronic data on a daily basis, it may be necessary to develop packages for automating and facilitating access to certain processes. Packages are a set of functions (related to each other or not) which improve the basic language. They are specific to every language, e.g. R and Python have their own packages, which do exactly the same thing, but their syntax is different. A large number of packages are available on the web for every language (e.g. on CRAN for R). Regardless of the language or purpose of a package, documentation is an essential step in development. Firstly, it forces a review of the package, helping to detect errors made during design. It also helps to provide a clear description for distribution purposes, which is often the case, whether it is published on the internet or just used in your group.

To make it clear, here’s an example of the roxygen2 package for R. Import it directly from your R script with the command install.packages(“roxygen2”).

Comments in R are preceded by a hash tag, like this:

# Individual comment

roxygen2 comments are preceded by a hash tag and a single quotation mark, as follows:

#’ This is a roxygen2 comment

This is what enables the machine to interpret them.

Let’s take the example on the package website:

The grey lines correspond to the roxygen2 comments. The blue lines are basic R controls, namely for declaring a function. In this example, the first line is the function description, the @param lines are the function parameters, the @return line is the returned result, and the @examples line correspond to examples of using the function. When the script is executed, roxygen2 will convert all the comments it recognizes into the Rd format, then R will convert it into a format readable by humans.

Python users can consult the documentation for Sphinx and Pydoc.

Pydoc is easy to install. Simply import the module from your Python script with the command “import Pydoc”.

To use Sphinx, install it and add it to Python. See this page for the procedure depending on your operating system. For more information about how Sphinx works, see the documentation available on the web, which was generated using Sphinx. The syntax used for creating this documentation is available here.

An article published in May 2018 in Metabolomics (4) mentions the use of Sphinx for generating the documentation coming from the Python package they designed. The purpose of the package is to improve deposits in the data repository Metabolomics Workbench where mass spectrometry and nuclear magnetic resonance data in the mwTab format can be deposited for more interoperability and ease of use.

  1. DESQUILBET, Loic, GRANGER Sabrina, HEJBLUM Boris, LEGRAND Arnaud, PERNOT Pascal et al. Towards reproducible research: Upgrade your practices [Vers une recherche reproductible : Faire évoluer ses pratiques]. Unité régionale de formation à l’information scientifique et technique de Bordeaux, 2019, p. 112. 112.
  2. What Is the Jupyter Notebook? — Jupyter Notebook 6.0.3 Documentation. Available for consultation at: https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/What%20is%20the%20Jupyter%20Notebook.html.
  3. ZANETTA, Pierre-marie. Data for: Modal Abundance, Density and Chemistry of Micrometer-Sized Assemblages by Advanced Electron Microscopy: Application to Chondrites. Vol. 1, Apr. 2019, doi:10.17632/kdr8wk63h6.1
  4. SMELTER, Andrey. MOSELEY, Hunter N. B.. A Python library for FAIRer access and deposition to the Metabolomics Workbench Data Repository. Vol. 15, May 2018, doi:10.1007/s11306-018-1356-6