Establishing a data management plan

Standard formats in chemistry

Reuse of research data is often hindered by the computer formats used. It is important to carefully consider format choices when preparing a data management plan. Choose standards that are open or widely recognized by the community. These are the main examples for chemistry. (1)

FormatDescription
JcampExtension .jdx or .dx. Open and universal standard specific to spectrometry. Used since 1988, it is one of the oldest formats. It is managed by IUPAC and compatible with most spectrum viewers.
mzMLAn open format, created in 2006, dedicated to mass spectrometry. Uses XML coding. Most proprietary formats can be converted into mzML using a converter (e.g. CompassXport for Bruker, MSConvert for Agilent, ThermoFisher, Shimadzu etc.)
molMolecule proprietary format created by MDL, it is one of the most common formats used for accurately coding molecules. Most software reads MOL files or exports in this format.
sdfStructure data format. MOL format extension, also developed by MDL. Encodes several molecules , like the MOL format, and metadata can also be included (“tags”).
rxnFormat also developed by MDL during the 1990s. The most popular format for storing information about reactions. Contains reagents and reaction products.
rdfReaction data file. For storing reactions and molecules and includes tags at the end of the file.
cmlOpen metadata format derived from XML, applied to chemistry and developed at the end of the 1990s. Encodes molecules, reactions and spectra without losing associated information. Format compatible with tools such as JChemPaint, Jmol, XDrawChem, MarvinView.
SmilesUniversal format that encodes a molecule in a line of text. Useful for describing sub-structures. It can also be used to encode reactions.
Isomeric SmilesExtension of the previous format and can be used for stereochemistry.
InChiAnother universal encoding format for molecules from a line of text. Provides more details than Smiles.
InChi KeyAnother format for encoding molecules in a line of text that is found in many software programs and data banks.
xyzA more specific format that defines molecule geometry.
FIDProprietary format developed by Bruker, encoding RMN data.

Other resources:
OpenBabel on-line converter specialized in chemistry formats
ChemAxon converter for on-line command conversion

  • 1. This typology has been prepared with the assistance of Thierry Billard, CNRS Director of Research at ICBMS (UMR CNRS 5246).