A data management plan: What is it for?
“It may seem boring, but it is essential.” (1)
That’s how a slightly sardonic editorial in Nature summarized the challenge for researchers requested to produce a data management plan, which is now considered as a required deliverable for national and European research funding bodies.
Although data management plans are often considered tedious administrative documents, they are useful for formalising and resolving issues that arise from managing data and codes throughout their lifecycle within a research project. This is particularly true for doctoral students.
Developing a Data Management Plan: Requirements and Challenges
As part of the Open Research Data pilot, the ERC introduced the requirement to draft a data management plan within six months of funding being awarded, starting in 2017. Since 2021, this requirement has applied to all recipients of European funding. The first version of the DMP must be submitted within six months of the project’s launch. At least two updates to this document are required: one at the midpoint of the project and one at its conclusion. The DMP template recommended by Horizon Europe is geared entirely towards making data FAIR LIEN A FAIRE. Please note that consideration of data management from a FAIR perspective should begin as early as when drafting the research proposal submitted under Horizon Europe. In the technical description of the proposal, one section addresses the procedures for managing data and other research outputs : “ Applicants generating/collecting data and/or other research outputs (except for publications) during the project must provide maximum 1 page on how the data/ research outputs will be managed in line with the FAIR principles (Findable, Accessible, Interoperable, Reusable), addressing the following (the description should be specific to your project).”
Since 2019, principal investigators of projects funded by the French National Research Agency (ANR) have been required to submit a data management plan to the agency within six months of the project’s start date. This document is intended to be a ‘living document’ and should be revised at the midpoint of the project, before the final version is submitted, for projects lasting more than 30 months. However, as specified in the DMP FAQ (in french), “the final grant payment is contingent upon receipt of a data management plan and its updates by the end date of the scientific work”.
The document is organised into various topics and is designed to address each stage of the data lifecycle in relation to data management: collection or production; organisation and documentation; storage; security; processing; sharing; retention; and legal considerations.
- What data will be collected: what types of data, how it is collected, where to store it, how to secure the storage, data volume, file formats, organization, documentation, etc.
- How will it be used: how will it be shared during the project, how will it be processed, where will it be processed,
- How will it be stored and backed up
- How will it be utilized: how to share it, in what format, under what license, which data, how to associate codes, …, …
- How will it be preserved: for how long, which data, where, how …
- How will the necessary resources be funded?
Beyond the deliverables and administrative requirements of funding agencies, the PGD provides an incentive to plan the management of research data associated with a project. Beginning with the submission of the funding application, the process of considering the data that will be generated and how to manage it evolves alongside the project, being refined throughout its duration.
Challenges and Benefits for Researchers
While many funding agencies require a Data Management Plan, drafting one also has practical benefits for researchers, project leaders, lab directors, equipment managers and doctoral students.
It facilitates the implementation of best practices at every stage of the research process, from collection and analysis to dissemination, in accordance with the FAIR principles (findable, accessible, interoperable and reusable).
It provides an overview of all the data generated and used throughout your project.
It outlines your data management strategies and practices.
It helps to ensure that data of scientific value to your team and community is processed and preserved effectively and securely in the medium to long term, while also helping you to anticipate potential storage issues or legal concerns that your project might raise.
The Data Management Plan documents data traceability, responsibilities, rights of use and reuse, and how data can be shared among partners. It clarifies all these points and helps to align your project with an open science approach.
It is therefore extremely useful for both you and your partners.
- For yourself: Since research and doctoral programmes span several years, the PGD enables you to maintain a personal record of the project’s progress and data management decisions.
- For project partners and the team: The PGD facilitates information sharing among project team members. It is particularly useful when integrated into teams’ work processes. Considering the PGD in the project planning phase enables you to incorporate associated costs into the project budget, allocate responsibilities, integrate data management into teams’ work processes and plan for the documentation and reuse of existing information
In short, the Data Management Plan (DMP) enables you to:
– save time and resources
– anticipate and manage risks associated with data loss and the malicious disclosure of sensitive or confidential data
– estimate the costs associated with storing, documenting, cleaning and selecting research data at every stage of the data lifecycle;
– identify legal procedures and issues to be discussed with partners in order to fulfil legal and ethical obligations prior to the project;
– facilitate access to, and understanding of, data, as well as its adoption by, potential users (e.g. new project members);
– assign responsibility for data management, at each stage of the lifecycle or for specific tasks, to identified team members.
Different scopes of application for the PGD
Although the research project management document is the most common type of project management document, it is possible (and recommended) to adapt this type of document for use in various contexts.
- For a PhD: While not all issues necessarily fall under the PhD student’s remit, matters related to the production or reuse of data — including its organisation, description, storage, backup and dissemination — are important. It is therefore important to start planning a Data Management Plan (DMP) as soon as possible, as this can provide a framework for discussions with advisors, team leaders and others.
- For an organization (such as a laboratory). In this case, the DMP helps define and describe an entity’s policy regarding research data management: how roles are distributed within the team, what resources are allocated to this management, what shared practices exist within the organization regarding the management and sharing of generated data, and what storage or backup systems are in place. These elements can then be incorporated into the data management plans of projects under the laboratory’s purview.
- For research code and software: When drafting a data management plan, it is now possible to create a plan specifically tailored to the development and utilisation of code and software; you can now list code or software as a research output. The questionnaire has been designed to take into account the specific characteristics of the lifecycle of source code and software (e.g. development and runtime environments, build systems, version control systems, programming languages, dependencies, etc.).
DMP template
Various institutions and funding bodies have developed different models of data management plans. However, the Open Science Committee, all French funding agencies (ADEME, ANR, ANRS-MIE, ANSES, FRM and INCa) and numerous institutions now recommend the structured Science Europe template as a common DMP template.
This template is available by default on the DMP OPIDoR website. It enables the automatic retrieval of information (e.g. on funded projects), the use of persistent identifiers (e.g. (ORCID, ROR, DOI) and the integration of community-specific repositories (e.g. metadata standards), all of which are in accordance with FAIR principles.
It facilitates the exchange of information with the ANR and various data management services (e.g. mesocentres). It is machine-readable.
It also integrates code and software management within a single PGD. The common model enables the production of PGD formats that are both human-readable and machine-processable, as well as an exchange format that complies with RDA recommendations.
DMP Opidor Tool :
Launched in 2016, DMP Opidor is a tool designed to assist with the online creation of Data Management Plans (DMPs). This tool is hosted and managed by Inist-CNRS.
It offers several useful features:
- Collaborative editing, proofreading, and commenting
- Ability to export and share
- Access to recommendations drafted by the ANR and various institutions, as well as public DMPs
- Ability to connect with a local support team (e.g., data workshops)
- Ability to automatically share with the French National Research Agency (ANR).
INIST regularly hosts webinars to help users draft PGDs using the platform. A tutorial (in french) on the Opidor DMP drafting tool is also available on DoraNum.
There are other tools available at the international level:
- Argos (OpenAire) : tool for creating a Data Management Plan or a Software Management Plan.
- DMP Tool : This free, open-source tool helps researchers create data management plans (DMPs) using templates tailored to various funding agencies, such as the National Science Foundation and the National Institutes of Health. Institutions such as University of California Curation Center (UC3), DataONE, et le Digital Curation Centre (DCC-UK) developed this tool.
- DSWizard : tool for creating a Data Management Plan
The development of generative AI has led to the creation of tools to assist with drafting data management plans. These tools are often still in the experimental phase (as are this tool ou de that one. Such tools should be approached with caution. Is there a risk that data relating to your project could be disclosed or used to train the model? Is the information provided accurate or fabricated? What is the environmental cost of using this tool?
Other resources
There are several guides that can help with writing.
The SOS PGD website lists the services offered by French universities to assist researchers in drafting a data management plan.
Finally, a few websites can guide researchers on specific issues:
- Tools for estimating data management costs. This tool, developed by TU Delft, the largest public university in the Netherlands, is based on the number of full-time equivalents required, depending on whether the volume of data produced is less than or greater than 5 TB, whether the data being processed is confidential, the number of partners and any personal data concerns. Another tool, developed by EPFL in Switzerland, takes infrastructure costs into account (e.g. servers, electronic lab notebooks and data warehouses). More general estimates suggest that, on average, 5% of the total project budget is allocated to cover data management costs.
- A tool for assessing hether your management plan complies with the FAIR principles governing data (Findable, Accessible, Interoperable, Reusable), developed by the ARDC (part of the Australian National Research Infrastructure).
- An overview of metadata standards LIEN A ACTUALISER applicable to chemistry and physics.
- A list of file formats to prioritize or avoid, depending on the desired level of long-term preservation.
- Tools to help you choose the distribution license you want to apply to your datasets. Here you will find a license selector tool hosted on GitHub. You can also visit the choosealicense platform.
A few figures illustrating the challenge:
- According to a study published in 2015, only 8% of physics articles and 5% of chemistry articles publish data associated with their publications. (2)
- 50% of experiments are considered to be non-reproducible. (3)
- 80% of the data produced in the last 20 years may well be lost. (4)
- 93% of higher education institutions have no procedure for research data management plans. (5)
- 90% of researchers questioned in a European survey (6) say they individually store, archive or transmit their data.
- 33 % of the same researchers have never heard of data management plans or consider they do not need them. (7)
- Over 80 % of the data produced are stored elsewhere than in repositories. (8)
Several tools are made available to help you in your approach.
- A guide to preparing data management plans, DMPOpidor (established by l’Institut national de l’information scientifique et technique, the French documentation center). When creating an account, you will be guided, step-by-step to draft your DMP, register the version, share and submit it for comments by your partners. Access to the DMP is restricted by default, but the settings can be changed to make it public.
- Some (fictitious) DMP models put on-line after national open science events.
- A comparison of repositories which can receive your data.
- A panoramic view of metadata standards applicable to Physics and Chemistry.
- A tool for estimating the costs. This one was developed by TU Delft, the largest public university in the Netherlands. The final criteria is the FTE required depending on the volume of data generated (less than or greater than 5 TB), whether or not the processed data is confidential, the number of partners and possible personal privacy issues. This other tool, developped by the EPFL located in Switzerland, can also display structural costs (servers, electronic laboratory notebooks, repositories…)
More global estimates tend to assign an average of 5% of the total budget of the project to cover the expenses related to data management. - An assessment tool for the conformity of your management plan with FAIR principles (Findable, Accessible, Interoperable, Reusable) governing the data, developed by the ARDC (which reports to the Australian national research body).
- Helpful tools for selecting the distribution license for your data sets. You will find this selection tool for licences available on Github. You can also refer to the choosealicense platform.
- A listing of best practices for making data sets available on-line on Figshare. The points discussed are rather exhaustive and can be consulted even if you do not intend to deposit data in the repository.
- If your project includes personal information, it is important to respect the General Data Protection Regulation (GDPR), with an impact assessment on data protection.
For this purpose, the CNIL makes the open-source PIA tool freely available for download.
Moreover, contacts are listed on the Openaccess Couperin site as well as in the SOS DMP section, listing the French university departments available to assist researchers with preparation of their data management plan.
A few key dates for DMPs
1966: outlines of data management plans emerge in aeronautics.
1973: NASA publishes a technical report which resembles a DMP.
2006: the Medical Research Council (United Kingdom) requires the implementation of DMPs as a condition of funding.
2007: the Wellcome trust (United Kingdom), today a member of the S Plan, requires DMPs as a condition of funding.
2007: the OCDE publishes guidelines, calling upon the scientific communities to document and archive research data.
2011: the National Science Foundation (United States) requires DMPs as a condition of funding.
2014: the EU requires DMPs for H2020 projects as a condition of funding.
2019: the ANR requires DMPs as a condition of funding.
Chronology inspired from: Smale, Nicholas, et al « The History, Advocacy and Efficacy of Data Management Plans ». BioRxiv, octobre 2018. www.biorxiv.org, doi: 10.1101/443499.
- « Everyone Needs a Data-Management Plan ». Nature, vol. 555, mars 2018, p. 286. doi:10.1038/d41586-018-03065-z.
- Womack, Ryan P. « Research Data in Core Journals in Biology, Chemistry, Mathematics, and Physics ». PLOS ONE, vol. 10, nᵒ 12, déc 2015. doi:10.1371/journal.pone.0143460.
- European Commission report : « Realising the European Open Science Cloud », 2016.
- Explanatory memorandum for the digital republic bill, consulté sur Legifrance.
- French Court of Audit report on electronic infrastructures and ESR, 2020.
- European Commission Report : “Providing researchers with the skills and competencies they need to practise Open Science”, 2017.
- Ibid.
- Commission européenne, « Realising the European Open Science Cloud », op. cit.