DIABIMMUNE Amplicon analysis

As a working example here we have used existing raw 16S rRNA gene data from the DIABIMMUNE project and NG-Tax 2.0 for data analyses. 

Raw amplicon data of over 1800 samples was downloaded from the project and automatically ingested by the UNLOCK infrastructure and stored according to the ISA standard. Metadata of these samples was also captured.

The amplicon data was automatically analysed with NG-Tax 2.0 to generate Amplicon Sequence Variants (ASV). Using the RDF data model, these ASV’s are automatically stored in a graph database as objects that link ASV sequences with the full data-wise and element-wise provenance. The graph database can be directly queried, allowing for comparative analyses over thousands of samples. Examples are given below.

Metadata can be directly sorted and accessed through the semantic framework. This enables users to easily select samples of interest for further down-stream analysis.
Amplicon data is noisy. NG-Tax 2.0 suppresses this noise by provisionally rejecting ASVs of low abundance. Due to the extensive incorporation of metadata and sequence information each (potentially rejected) ASV individually can be accessed and studied.
Unlike OTUs which are obtained by clustering of nearly identical sequences Amplicon Sequence Variants are believed to better represent a single species. Since NG-Tax 2.0 extracts ASVs and stores them in a semantic database, cross-mapping of ASVs between thousands of samples is possible. In the example we have plotted the number of ASVs including the provisionally rejected ASV shared between samples.
Within each sample, ASVs are analysed and potentially flagged as rejected due to severely low abundant reads. These ASVs might be falsely flagged and through cross-mapping within and between samples these ASVs can become accepted ASVs and used for further downstream analysis.
For each individual sample, NG-Tax 2.0 provisionally rejects ASVs of low abundance. While most of these ASVs are likely random noise signals, a provisionally rejected ASV can also point at species with a particular low abundance in the specific sample. We can study this by asking two related questions. “Do we find a particular provisionally rejected ASV in more samples?” and “Is this ASV an accepted ASV in other samples?” As the ASVs are stored in a graph database, this, and the results presented in the examples above, can be done with standard SPARQL queries. In the example we have plotted the number of the provisionally rejected ASV that are shared between the 1800 samples. As can be seen by far the majority occurs in less than three samples and on the basis of this most probably most of them are noise ASvs. The example SPARQL query in the right panel explores this a little bit further and provides an answer to the question “How many provisionally rejected ASVs are in fact accepted ASVs in other samples. The Venn diagram provides the answer and we can conclude from this that many of the accepted species can be present in extremely low abundances in other samples.

The queries used to generate the examples shown can be embedded in or be part of standard operating procedures (SOPs). For instance, further post-processing can be done using structured data analysis processes integrated in Jupyter Notebooks.

Leave a Reply