Monitoring the physiological state of a mixed microbial community as a whole, and exploration of inter-microbial interactions within requires a data management system that allows for tight integration of wet-dry lab approaches. The basic functionality of such a system includes data (i) collection, (ii) integration and (iii) delivery. Obtaining a high degree of data interoperability is key and requires automatic integration of laboratory process execution (LIMS) data, collected (Omics) assay data and associated experimental meta-data in a Findable Accessible, Interoperable and Reusable (FAIR) format. Application of these four foundational principles will allow researchers to extract maximum benefit from the research investments made.
Platform Technical details
The UNLOCK FAIR data platform will be equipped with an up-to-date ecosystem of robust state-of-the-art open source tools for data handling, information retrieval, statistical analysis and visualization. The ‘Investigation’ (the project context), ‘Study’ (a unit of research) and ‘Assay’ (analytical measurement) data model (ISA) will be applied to collect, organize, store and handle all on- and offline digital data. To each data stream, meta data is added using Minimal Information Models.
For digital data handling we have, in close collaboration with SURFsara, implemented the Integrated Rule-Oriented Data System (iRODS) that will serve as a backbone to manage all types of UNLOCK digital data. Maintaining the UNLOCK iRODs infrastructure and long-term preservation of data generated within the UNLOCK infrastructure will be outsourced to SURFsara.
A working example using 16S rRNA gene data
As a working example here we have used existing raw 16S rRNA gene data from the DIABIMMUNE project and NG-Tax 2.0 for data analyses.
Raw amplicon data of over 1800 samples was downloaded from the project and automatically ingested by the UNLOCK infrastructure and stored according to the ISA standard. Metadata of these samples was also captured.
The amplicon data was automatically analysed with NG-Tax 2.0 to generate Amplicon Sequence Variants (ASV). Using the RDF data model, these ASV’s are automatically stored in a graph database as objects that link ASV sequences with the full data-wise and element-wise provenance. The graph database can be directly queried, allowing for comparative analyses over thousands of samples. Examples are given below.
The queries used to generate the examples shown can be embedded in or be part of standard operating procedures (SOPs). For instance, further post-processing can be done using structured data analysis processes integrated in Jupyter Notebooks.