NG-Tax 2.0 allows FAIR high-throughput analysis and classification of marker gene amplicon sequences. In this article, we describe the performance of NG-Tax 2.0. Moreover, we demonstrate its use with examplary data from the DIABIMMUNE project.
By Jasper Koehorst / July 7, 2021
NG-Tax 2.0 is a semantic framework for FAIR high-throughput analysis and classification of marker gene amplicon sequences. These include bacterial and archaeal 16S ribosomal RNA (rRNA), as well as eukaryotic 18S rRNA and ribosomal intergenic transcribed spacer sequences. The framework can directly use single or merged reads, paired-end reads and unmerged paired-end reads from long range fragments as input. From these, it generates de novo Amplicon Sequence Variants (ASV). Subsequently, using the RDF data model, ASVs can be automatically stored in a graph database as objects that link ASV sequences with the full data-wise and element-wise provenance. Hereby, it achieves the level of interoperability required to utilize such data to its full potential.
Analysis of the data
Subsequently, the graph database can be directly queried, allowing for comparative analyses of over thousands of samples. It is, moreover, connected with an interactive Rshiny toolbox for analysis and visualization of (meta)data. Additionally, NG-Tax 2.0 exports an extended BIOM 1.0 (JSON) file as starting point for further analyses by other means. This file contains new attribute types to include information about the command arguments used, the sequences of the ASVs formed as well as classification confidence scores. Finally, it is backwards compatible. In summary, the figure below describes the whole NG-Tax 2.0 workflow.
Performance and availability
We compared the performance of NG-Tax 2.0 with DADA2 using the plugin in the QIIME 2 analysis pipeline. To this aim, we obtained and evaluated fourteen 16S rRNA gene amplicon mock community samples from the literature. As a result, precision of NG-Tax 2.0 turned out to be significantly higher with an average of 0.95 vs 0.58 for QIIME2-DADA2. Meanwhile, recall was comparable with an average of 0.85 and 0.77, respectively. NG-Tax 2.0 is written in Java. Under the MIT License, you can freely access the code, ontology, a Galaxy platform implementation, the analysis toolbox, as well as tutorials and example SPARQL queries.
Working example: DIABIMMUNE project
As a working example, we used existing raw 16S rRNA gene data from the DIABIMMUNE Microbiome project and NG-Tax 2.0 for data analyses. To this end, we downloaded raw amplicon data of over 1800 microbial samples from the project. Next, the data was automatically ingested by the UNLOCK infrastructure and stored according to the ISA standard. We also captured the metadata of these samples. Then, NG-Tax 20 automatically analysed the amplicon data to generate ASVs as we described above and exemplified below. The queries used to generate the examples shown can also be embedded in or be part of standard operating procedures (SOPs). Further post-processing can take place using structured data analysis processes integrated in Jupyter Notebooks.