The FAIR data platform

Monitoring the physiological state of a mixed microbial community as a whole, and exploration of inter-microbial interactions within requires a data management system that allows for tight integration of wet-dry lab approaches. The basic functionality of such a system includes data (i) collection, (ii) integration and (iii) delivery. Obtaining a high degree of data interoperability is key and requires automatic integration of laboratory process execution (LIMS) data, collected (Omics) assay data and associated experimental meta-data in a Findable Accessible, Interoperable and Reusable (FAIR) format. Application of these four foundational principles will allow researchers to extract maximum benefit from the research investments made. 

Platform Technical details

The UNLOCK FAIR data platform will be equipped with an up-to-date ecosystem of robust state-of-the-art open source tools for data handling, information retrieval, statistical analysis and visualization. The ‘Investigation’ (the project context), ‘Study’ (a unit of research) and ‘Assay’ (analytical measurement) data model (ISA) will be applied to collect, organize, store and handle all on- and offline digital data. To each data stream, meta data is added using Minimal Information Models.

For digital data handling we have, in close collaboration with SURFsara, implemented the Integrated Rule-Oriented Data System (iRODS) that will serve as a backbone to manage all types of UNLOCK digital data. Maintaining the UNLOCK iRODs infrastructure and long-term preservation of data generated within the UNLOCK infrastructure will be outsourced to SURFsara.

Figure 1. A schematic representation of the data infrastructure used within UNLOCK. The iRODs data management system captures the experimental data streams. To enable the FAIR by Design principles, element- and data-wise experimental metadata generated by the lab equipment used and other required experimental meta-data is automatically linked with the data-streams and permanently stored within the iRODS infrastructure. High throughput analysis of the data is done using a scalable cloud-based infrastructure and dockerized open source applications. Compute results and corresponding metadata are stored in the iRODS platform using the ISA data model. Further post-processing can be done using structured data analysis processes integrated in Jupyter Notebooks.

A working example using 16S rRNA gene data

As a working example here we have used existing raw 16S rRNA gene data from the DIABIMMUNE project and NG-Tax 2.0 for data analyses. 

Raw amplicon data of over 1800 samples was downloaded from the project and automatically ingested by the UNLOCK infrastructure and stored according to the ISA standard. Metadata of these samples was also captured.

The amplicon data was automatically analysed with NG-Tax 2.0 to generate Amplicon Sequence Variants (ASV). Using the RDF data model, these ASV’s are automatically stored in a graph database as objects that link ASV sequences with the full data-wise and element-wise provenance. The graph database can be directly queried, allowing for comparative analyses over thousands of samples. Examples are given below.

Metadata can be directly sorted and accessed through the semantic framework. This enables users to easily select samples of interest for further down-stream analysis.
Amplicon data is noisy. NG-Tax 2.0 suppresses this noise by provisionally rejecting ASVs of low abundance. Due to the extensive incorporation of metadata and sequence information each (potentially rejected) ASV individually can be accessed and studied.
Unlike OTUs which are obtained by clustering of nearly identical sequences Amplicon Sequence Variants are believed to better represent a single species. Since NG-Tax 2.0 extracts ASVs and stores them in a semantic database, cross-mapping of ASVs between thousands of samples is possible. In the example we have plotted the number of ASVs including the provisionally rejected ASV shared between samples.
Within each sample, ASVs are analysed and potentially flagged as rejected due to severely low abundant reads. These ASVs might be falsely flagged and through cross-mapping within and between samples these ASVs can become accepted ASVs and used for further downstream analysis.
For each individual sample, NG-Tax 2.0 provisionally rejects ASVs of low abundance. While most of these ASVs are likely random noise signals, a provisionally rejected ASV can also point at species with a particular low abundance in the specific sample. We can study this by asking two related questions. “Do we find a particular provisionally rejected ASV in more samples?” and “Is this ASV an accepted ASV in other samples?” As the ASVs are stored in a graph database, this, and the results presented in the examples above, can be done with standard SPARQL queries. In the example we have plotted the number of the provisionally rejected ASV that are shared between the 1800 samples. As can be seen by far the majority occurs in less than three samples and on the basis of this most probably most of them are noise ASvs. The example SPARQL query in the right panel explores this a little bit further and provides an answer to the question “How many provisionally rejected ASVs are in fact accepted ASVs in other samples. The Venn diagram provides the answer and we can conclude from this that many of the accepted species can be present in extremely low abundances in other samples.

The queries used to generate the examples shown can be embedded in or be part of standard operating procedures (SOPs). For instance, further post-processing can be done using structured data analysis processes integrated in Jupyter Notebooks.