You are currently viewing Technical details amplicon analysis

Technical details amplicon analysis

NG-Tax 2.0 allows FAIR high-throughput analysis and classification of marker gene amplicon sequences. In this article, we describe the performance of NG-Tax 2.0. Moreover, we demonstrate its use with examplary data from the DIABIMMUNE project.

By Jasper Koehorst / July 7, 2021

KEY MESSAGES

Technical difficulty
5/5

NG-Tax 2.0 is a semantic framework for FAIR high-throughput analysis and classification of marker gene amplicon sequences. These include bacterial and archaeal 16S ribosomal RNA (rRNA), as well as eukaryotic 18S rRNA and ribosomal intergenic transcribed spacer sequences. The framework can directly use single or merged reads, paired-end reads and unmerged paired-end reads from long range fragments as input. From these, it generates de novo Amplicon Sequence Variants (ASV). Subsequently, using the RDF data model, ASVs can be automatically stored in a graph database as objects that link ASV sequences with the full data-wise and element-wise provenance. Hereby, it achieves the level of interoperability required to utilize such data to its full potential. 

Analysis of the data

Subsequently, the graph database can be directly queried, allowing for comparative analyses of over thousands of samples. It is, moreover, connected with an interactive Rshiny toolbox for analysis and visualization of (meta)data. Additionally, NG-Tax 2.0 exports an extended BIOM 1.0 (JSON) file as starting point for further analyses by other means. This file contains new attribute types to include information about the command arguments used, the sequences of the ASVs formed as well as classification confidence scores. Finally, it is backwards compatible. In summary,  the figure below describes the whole NG-Tax 2.0 workflow.

Schematic figure describing NG Tax 2.0 pipeline
In summary, the NG-Tax 2.0 workflow consists of four main steps: First, barcode and primer filtering (A), followed by de novo OTU-picking of ASV sequences, artefact filtering, correction for the impact of error reads on ASV relative abundance estimates and taxonomic inference (B). The next step (C) is ASV object serialization and storage, in which ASV sequences, taxonomic inferences and data provenance (including library and sample names and used settings) are exported and stored as ASV objects in an RDF triple store graph database. Optionally, these are exported in the Biom 1.0 file format. Finally (D), the downstream analysis tool box is used to directly query and analyse the ASV data and meta-data through the SPARQL endpoint. Additionally, the Rshiny toolbox directly provides standard statistics and visualizations using predefined SPARQL queries.

Performance and availability

We compared the performance of NG-Tax 2.0 with DADA2 using the plugin in the QIIME 2 analysis pipeline. To this aim, we obtained and evaluated fourteen 16S rRNA gene amplicon mock community samples from the literature. As a result, precision of NG-Tax 2.0 turned out to be significantly higher with an average of 0.95 vs 0.58 for QIIME2-DADA2. Meanwhile, recall was comparable with an average of 0.85 and 0.77, respectively. NG-Tax 2.0 is written in Java. Under the MIT License, you can freely access the code, ontology, a Galaxy platform implementation, the analysis toolbox, as well as tutorials and example SPARQL queries.

Working example: DIABIMMUNE project

As a working example, we used existing raw 16S rRNA gene data from the DIABIMMUNE Microbiome project and NG-Tax 2.0 for data analyses. To this end, we downloaded raw amplicon data of over 1800 microbial samples from the project. Next, the data was automatically ingested by the UNLOCK infrastructure and stored according to the ISA standard. We also captured the metadata of these samples. Then, NG-Tax 20 automatically analysed the amplicon data to generate ASVs as we described above and exemplified below. The queries used to generate the examples shown can also be embedded in or be part of standard operating procedures (SOPs). Further post-processing can take place using structured data analysis processes integrated in Jupyter Notebooks.

 

As the table above shows, users can directly sort and access metadata through the semantic framework. This enables users to easily select samples of interest for further downstream analysis.

 

As shown in the graph above, amplicon data is noisy. NG-Tax 2.0, however, suppresses this noise by provisionally rejecting ASVs of low abundance. Moreover, due to the extensive incorporation of metadata and sequence information, users can access and study each (potentially rejected) ASV individually.

 

Unlike OTUs obtained by clustering of nearly identical sequences, Amplicon Sequence Variants suppoedly better represent a single species. Since NG-Tax 2.0 extracts ASVs and stores them in a semantic database, it enables cross-mapping of ASVs between thousands of samples. As shown in the grah above, we have plotted the number of ASVs, including the provisionally rejected ASV shared between samples.

 

Next, within each sample, NG-Tax analyses and potentially flags ASVs as rejected due to severely low abundant reads, as seen in graph above. These ASVs might be falsely flagged, but through cross-mapping within and between samples these ASVs can become accepted and used for further downstream analysis.

 

For each individual sample, NG-Tax 2.0 provisionally rejects ASVs of low abundance. While most of these ASVs are likely random noise signals, a provisionally rejected ASV can also point at microbial species with a particular low abundance in the specific sample. We can study this by asking two related questions.

1. “Do we find a particular provisionally rejected ASV in more samples?”

2. “Is this ASV an accepted ASV in other samples?” 

As the ASVs are stored in a graph database, this and the results presented in the examples above, can be done with standard SPARQL queries. In the example, we have plotted the number of the provisionally rejected ASV that are shared between the 1800 samples. As can be seen, by far the majority occurs in less than three samples and on the basis of this most probably most of them are noise ASvs. The example SPARQL query above explores this a little bit further and provides an answer to the question “How many provisionally rejected ASVs are in fact accepted ASVs in other samples?”.

Interesting links

Do you want to learn about more application of NG-Tax 2.0? Via this link, you can read articles that have cited NG-Tax 2.0. This blogpost comprises a short summary of the research paper below, which you can access below for more technical details or cite as:

Finally, this blogpost is linked to our FAIR Data Platform, of which you can visit the platform page for more information.

Please share this