In my classification, nearly more than 50% of the reads are clustered as unclassified or classified upto kingdom level. since the percentage is very high, i cannot eliminate those results. How can i minimize the percentage of those otus.
A high number of uncharacterized sequences will be due to:
You are using an inappropriately trained classifier (if you are using the classify-sklearn method). Make sure your classifier is trained on the same gene region as your amplicons.
These sequences are non-target DNA, including host DNA. Anything that is unclassified or only classified at kingdom level should probably be removed, and you can use NCBI BLAST to spot check a few of these to confirm that they are non-target.
Your sequence reads are in mixed orientations. The classify-sklearn method assumes that your reads are in a uniform orientation and gets confused when mixed orientations are present. If they are in mixed orientations, use the blast-based classifier that @bsen2018 recommends — it can operate on these.
As in the previous post, We have amplified V1-V9 region using three different primers. We tried to separate three different primer using cutadapt, split_primers. We failed to separate. Hence i could not train the classifier based on my sequence. That may be one of the reason, Still trying to separate.
As you told i did BLAST for those uncharacterized/kingdom level classification. For some of the hits, I got bacterial species level information and for some of the hits, I got host DNA. but for those with bacterial species hit, the identity percentage and the coverage are less than 97%.
Sorry for the simplest question. What do u mean by mixed orientation?
Use a classifier trained on full-length 16S and it should work for all domains.
The reads contain a mixture of forward and reverse sequences, relative to the direction they are in in the reference sequence database.
Depending on how many of the unclassified sequences classify as bacteria with NCBI blast vs. host DNA, it sounds like you may have mixed orientation reads. Try the classify-consensus-blast method to see what happens.