incomplete taxonomic assignment

Dear Community,

I am processing multi-amplicon sequencing data from Ion GS S5 sequencing platform. My basic approach involves DADA2 denoising followed by feature classification using classify-consensus-blast/vsearch.
After classification, I am getting partial taxonomic assignment ( till family level/ class level) eg- "d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;;;__" .
I understand this might be happening due to poor primer trimming/small amplicon length or some other reason associated with sequence quality. However, when I take the ASV sequence and try using blast (NCBI nr) I am able to pickup the genus level taxonomy correctly.

What I want to know is, if there is any other way to improve the taxonomic assignment, to get a genus level assignment (even with little low confidence) for most of my ASVs ??

I tried using vsearch separately for the such ASVs with 0.8 cutoff, and the results are not very convincing.

Thank you for your support!!!

Hi @Saptarathi_Deb,

I am assuming you are running with default settings against using either SILVA or Greengenes reference database?

A few things to note:

  1. You could be observing a limitation of the reference database being used (i.e. SILVA / Greengenes).
  2. Be wary of how BLAST hits are displayed on NCBI. That is, equivalent BLAST hits are arbitrarily sorted, and if you scroll down far enough you may find that there is an identical "hit" to a very different organism.
  3. Given 2, this is why we have classify-consensus-blast and classify-consensus-vsearch. Any hits which cannot be taxonomically resolved have their taxonomy truncated to the last common ancestor. This also applies to classify-sklearn too.

More information can be found here:

You can also try your hand at using RESCRIPt to make your own reference database for classifying your sequences:

This is the limitation of assigning taxonomy using short reads. However, you can use tools like q2-clawback to help improve things:


1 Like

Thank you @SoilRotifer , I am trying out your suggestions.

Yes I am using the "Silva 138 SSURef NR99 full-length sequences"( I belive its the ReScript processed version of SILVA 138.1 in the qiime resources page.

After denoising I am using the default settings to classify the representative sequences using the consensus blast plugin.

I think that due to poor primer trimming or short incomplete amplicons are leading to incomplete taxonomic classification level.

What is the length of sequences you've generated from this platform? Quality?

It is quite common for many reads to only be classified to upper-level taxonomy. Have you tried using feature-classifier classify-sklearn ?

Can you share your taxonomy barplot qzv fle? You can DM me this file if you do not wish to share publicly. This way I can also look through the provenance and try to piece together what your processing steps are.

I am using data from Thermo 16s multi amplicon Kit.

  • It consists of 6 amplicons of length 200-250 base pair covering 7 variable regions
  • One issue is that the primers are unknown. The way we are dealing is by 1) trimming 20 base pairs from both sides. OR 2) Crack the primer sequence to the extent possible and use cut adapt to trim them (however, the reads are of mixed orientation from the sequencing machine, so it's less efficient sometimes I feel).

Since I want the classification to happen without separating the reads based on variable regions, it's not possible to use classify-sklearn.

I am attaching one of the barplot I generated after classify-consensus-blast taxonomy assignment. If you scroll down, you will find many taxa with classification till order or family level. I am not expecting until species level for sure, but it will be good to have genus level.

barplot_consensus_blast.qzv (783.0 KB)

The only reason I am a little greedy to classify them to genus level is that I realized some of the partially classify reads could change the differential abundance statistics if we take them re-classify using vsearch or NCBI BLAST, and add them to their original taxa.

I am trying to use classify-sklearn now by separating the reads region by region. I hope it works well.

Thank you for the support,

This is not much of a problem...

  • They are using proprietary primers? Often companies are willing to tell you how many bases to trim. Did you ask them for these details?
  • Are you sure the primers are even in the resulting sequence data? I ask because not all protocols sequence "through" the primer.
  • Mixed orientation:
    • Trimming primers from mixed orientated reads is not a problem for cutadapt.
    • You can also simply re-orient all of your reads to a reference database (e.g. SILVA ), using the orient-seqs command from the RESCRIPt plugin.

This is not true. You can simply use the full-length SILVA classifier for all variable regions.

This will work too of course. But you might be better off with my next suggestion. :slight_smile:

Have you looked into the q2-sidle plugin? You'll be able to leverage all of your amplicon regions to obtain a better taxonomy.


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.