Annotate Bacterial Genes Amplicon data

Hello, Team Qiime2 forum,

I have used 8 primers to amplify the sequences of bacterial genes and then pooled the data to generate amplicon sequencing data. I wanted to see the OUT/ASV clustering in the dataset.

For the same, I have imported the data in qiime2 followed by denoising using DADA2. Created classifier using fit-classifier-naive-bayes from data using SILVIA full-length sequences. I annotated the information using feature-classifier classify-sklearn. I visualized the information using metadata tabulate followed by taxa barplot. However, most of the features are unannotated. I am attaching the exact commands used for the process.

Please suggest if the approach is correct or some other plugins/parameters need to be changed to get a complete annotated feature table.

Importing data

qiime tools import --type MultiplexedPairedEndBarcodeInSequence --input-path fastq_files/ --output-path ./results/multiplexed-seqs.qza

Demultiplexed and summarise

Removed reverse barcodes

qiime cutadapt demux-paired --i-seqs ./results/multiplexed-seqs.qza --m-forward-barcodes-file metadata.tsv --m-forward-barcodes-column Forward --p-error-rate 0.1 --o-per-sample-sequences ./results/demux.qza --o-untrimmed-sequences ./results/untrimmed.qza --verbose 

qiime demux summarize --i-data ./results/demux.qza --o-visualization ./results/demux.qzv

Selecting Sequence Variants

qiime dada2 denoise-paired
--i-demultiplexed-seqs ./results/demux.qza
--o-table ./results/table-dada2
--o-representative-sequences ./results/rep-seqs-dada2.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 134
--p-trunc-len-r 120
--p-n-threads 40
--p-n-reads-learn 200000
--o-denoising-stats ./results/denoising-stats.qza

Create classifier

qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads ./classifier/silva-138-99-seqs.qza  --i-reference-taxonomy ./classifier/silva-138-99-tax.qza --o-classifier ./classifier/silva_99_341F806R_classifier.qza --verbose

qiime feature-classifier classify-sklearn --i-classifier  ./classifier/silva_99_341F806R_classifier.qza --i-reads ./results/rep-seqs-dada2.qza --o-classification ./results/taxonomy.qza

qiime metadata tabulate --m-input-file ./results/taxonomy.qza --o-visualization ./results/taxonomy.qzv

qiime taxa barplot --i-table ./results/table-dada2 --i-taxonomy ./results/taxonomy.qza --m-metadata-file metadata.tsv --o-visualization taxa-bar-plots.qzv![WhatsApp Image 2021-07-21 at 3.01.27 

As you can see most of the output data is unannotated

  1. Also, when I get .qza file from OTU clustering, how to create the PCA or some kind of clustering plots for that?

Hi @Dhwani_Dholakia, sorry for the slow reply. @Keegan-Evans will be with you soon!


Thanks for providing the details that you have about your analysis! I am wondering about the genes you are targeting in your study, it could just be that SILVA is not the correct database to use with your project.

However given the extremely low rate of annotation that you ended up with, I would suspect that somehow you still have "garbage" data in your samples. This often happens when primer sequences are not removed before any other steps are performed in the analysis. Making sure that you have removed all of your primer sequences is where I would start. You might take a look at this forum post.

You say that you use DADA2 for denoising, but you do not say what sequencing technology was used to generate your data. I bring this up because DADA2 is specifically for use with Illumina-sequenced data.

Also, DADA2 generates ASVs, which are a more precise, modern equivalent to OTUs. While it is possible to cluster again afterwards, take a look at the following resources and consider not clustering after denoising.

Exact sequence variants should replace operational taxonomic units in marker-gene data analysis
ESVs vs OTUs
To Cluster Or Not To Cluster

If you are specifically wanting PCA, you might want to checkout DEICODE. However, there are other PCoA methods that are more commonly used besides PCA(Euclidian PCoA), you might want to check out this forum thread for a more in depth discussion.

Hopefully this will help you move forward with your analysis! If this doesn't get things moving forward or you have more questions let me know.


@Keegan-Evans ,

How do we remove primer sequences, because here we have used 8 different primers. Usually, the cutadapt accepts one primer pair.

Also, why does primer act as garbage? Its a part of natural sequence just like the next amplicon, that is amplified


You can simply pass q2-cutadapt all of the primers that you want trimmed, for example:
--p-adapter-f forward_adapter_1 forward_adapter_2 forward_adapter_3

They have to be removed before denoising with q2-dada2. This is to avoid any issues with DADA2's chimera removal tools. Source.

You might want to checkout this video from one of our workshops on denoising.

Hope this helps!