Technical details amplicon analysis

Technical details amplicon analysis

NG-Tax 2.0 allows FAIR high-throughput analysis and classification of marker gene amplicon sequences. In this article, we describe the performance of NG-Tax 2.0 and further demonstrate its use with examplary data from the DIABIMMUNE project.

By Jasper Koehorst / July 7, 2021

KEY MESSAGES

Technical difficulty
5/5
NG-Tax 2.0 is a semantic framework for FAIR high-throughput analysis and classification of marker gene amplicon sequences including bacterial and archaeal 16S ribosomal RNA (rRNA), eukaryotic 18S rRNA and ribosomal intergenic transcribed spacer sequences. It can directly use single or merged reads, paired-end reads and unmerged paired-end reads from long range fragments as input to generate de novo Amplicon Sequence Variants (ASV). Using the RDF data model, NA thereby achieving the level of interoperability required to utilize such data to its full potential. The graph database can be directly queried, allowing for comparative analyses of over thousands of samples and is connected with an interactive Rshiny toolbox for analysis and visualization of (meta) data. Additionally, NG-Tax 2.0 exports an extended BIOM 1.0 (JSON) file as starting point for further analyses by other means. The extended BIOM file contains new attribute types to include information about the command arguments used, the sequences of the ASVs formed, classification confidence scores and is backwards compatible.

NG-Tax 2.0 workflow. The workflow consists of four main steps: (A) barcode and primer filtering, (B) de novo OTU-picking of ASV sequences, artefact filtering, correction for the impact of error reads on ASV relative abundance estimates and taxonomic inference; (C) ASV object serialization and storage. ASV sequences, taxonomic inferences and data provenance including library and sample names and used settings are exported and stored as ASV objects in an RDF triple store graph database and optionally exported in the Biom 1.0 file format. (D) Downstream analysis tool box. ASV data and meta-data can be directly queried and analysed through the SPARQL endpoint. The Rshiny toolbox directly provides standard statistics and visualizations using predefined SPARQL queries.

Performance and availability

The performance of NG-Tax 2.0 was compared with DADA2, using the plugin in the QIIME 2 analysis pipeline. Fourteen 16S rRNA gene amplicon mock community samples were obtained from the literature and evaluated. Precision of NG-Tax 2.0 was significantly higher with an average of 0.95 vs 0.58 for QIIME2-DADA2 while recall was comparable with an average of 0.85 and 0.77, respectively. NG-Tax 2.0 is written in Java. The code, the ontology, a Galaxy platform implementation, the analysis toolbox, tutorials and example SPARQL queries are freely available under the MIT License. 

Working example: DIABIMMUNE project

As a working example, we have used existing raw 16S rRNA gene data from the DIABIMMUNE project and NG-Tax 2.0 for data analyses. Raw amplicon data of over 1800 samples was downloaded from the project and automatically ingested by the UNLOCK infrastructure and stored according to the ISA standard. Metadata of these samples was also captured. The amplicon data was automatically analysed with NG-Tax 2.0 to generate Amplicon Sequence Variants (ASV). Using the RDF data model, these ASV’s are automatically stored in a graph database as objects that link ASV sequences with the full data-wise and element-wise provenance. The graph database can be directly queried, allowing for comparative analyses over thousands of samples. Examples are given below. The queries used to generate the examples shown can be embedded in or be part of standard operating procedures (SOPs). For instance, further post-processing can be done using structured data analysis processes integrated in Jupyter Notebooks.

 

Metadata can be directly sorted and accessed through the semantic framework. This enables users to easily select samples of interest for further down-stream analysis.

 

Amplicon data is noisy. NG-Tax 2.0 suppresses this noise by provisionally rejecting ASVs of low abundance. Due to the extensive incorporation of metadata and sequence information each (potentially rejected) ASV individually can be accessed and studied.

 

Unlike OTUs which are obtained by clustering of nearly identical sequences Amplicon Sequence Variants are believed to better represent a single species. Since NG-Tax 2.0 extracts ASVs and stores them in a semantic database, cross-mapping of ASVs between thousands of samples is possible. In the example we have plotted the number of ASVs including the provisionally rejected ASV shared between samples.

 

Within each sample, ASVs are analysed and potentially flagged as rejected due to severely low abundant reads. These ASVs might be falsely flagged and through cross-mapping within and between samples these ASVs can become accepted ASVs and used for further downstream analysis.

 

For each individual sample, NG-Tax 2.0 provisionally rejects ASVs of low abundance. While most of these ASVs are likely random noise signals, a provisionally rejected ASV can also point at species with a particular low abundance in the specific sample. We can study this by asking two related questions. “Do we find a particular provisionally rejected ASV in more samples?” and “Is this ASV an accepted ASV in other samples?” As the ASVs are stored in a graph database, this, and the results presented in the examples above, can be done with standard SPARQL queries. In the example we have plotted the number of the provisionally rejected ASV that are shared between the 1800 samples. As can be seen by far the majority occurs in less than three samples and on the basis of this most probably most of them are noise ASvs. The example SPARQL query in the right panel explores this a little bit further and provides an answer to the question “How many provisionally rejected ASVs are in fact accepted ASVs in other samples”.

Interesting links

Articles citing NG-Tax 2.0

Cite this article:

Poncheewin W., Hermes, G.D.A. et al. NG-Tax 2.0: A Semantic Framework for High-throughput Amplicon Analysis. Frontiers in Genetics 10 (2019): 1366.

Please share this