Uncharacterized OTUs

Dear all,
I have done taxonomic classification using silva database
https://data.qiime2.org/2019.1/common/silva-132-99-515-806-nb-classifier.qza

In my classification, nearly more than 50% of the reads are clustered as unclassified or classified upto kingdom level. since the percentage is very high, i cannot eliminate those results. How can i minimize the percentage of those otus.

I run into the same problem. Tried different settings for reads extraction, but with no success.

Have you given a try with consensus blast method? It could be better for fine resolution.

Hi @steffi,
Lots of other forum users have posted questions about kingdom-level classification and unclassified sequences. So I will summarize here but I recommend perusing those posts for more troubleshooting tips.

A high number of uncharacterized sequences will be due to:

  1. You are using an inappropriately trained classifier (if you are using the classify-sklearn method). Make sure your classifier is trained on the same gene region as your amplicons.
  2. These sequences are non-target DNA, including host DNA. Anything that is unclassified or only classified at kingdom level should probably be removed, and you can use NCBI BLAST to spot check a few of these to confirm that they are non-target.
  3. Your sequence reads are in mixed orientations. The classify-sklearn method assumes that your reads are in a uniform orientation and gets confused when mixed orientations are present. If they are in mixed orientations, use the blast-based classifier that @bsen2018 recommends — it can operate on these.

Good luck!

2 Likes

Dear @Nicholas_Bokulich


As in the previous post, We have amplified V1-V9 region using three different primers. We tried to separate three different primer using cutadapt, split_primers. We failed to separate. Hence i could not train the classifier based on my sequence. That may be one of the reason, Still trying to separate.

As you told i did BLAST for those uncharacterized/kingdom level classification. For some of the hits, I got bacterial species level information and for some of the hits, I got host DNA. but for those with bacterial species hit, the identity percentage and the coverage are less than 97%.

Sorry for the simplest question. What do u mean by mixed orientation?

Use a classifier trained on full-length 16S and it should work for all domains.

The reads contain a mixture of forward and reverse sequences, relative to the direction they are in in the reference sequence database.

Depending on how many of the unclassified sequences classify as bacteria with NCBI blast vs. host DNA, it sounds like you may have mixed orientation reads. Try the classify-consensus-blast method to see what happens.

Good luck!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.