I had the problem of low biomass too much Host DNA and taxonomic assignment at kingdom level rather than at genus level and also so many wrong assignments of host DNA to bacteria and I was going to re-sequence my samples. However, I tried reanalyzing using 99% identity and I got better results. Blasting with NCBI showed that these are actual bacterial DNA. I got one Eukaryota (host DNA) which I removed from my plot. The only problem that I have now is having too much-unassigned reads. I used SILVA 132 and 99% identity. I saw in other pages in the forum Best Feature-Classifier? that other classifying methods might reduce unclassified taxa but increase false assignments. Any suggestion about whether I should try those other methods or that might give me false taxonomy? Here is my bar plot:
what does NCBI BLAST say these unclassified sequences are? Since the classifier is clearly working (you get good classification for most other sequences) I would recommend removing any sequences that are unclassified... they are probably other non-target DNA, or even contaminants/errors that were not caught upstream.
I would discourage trying to reconfigure the classifier parameters too much. As that other topic warned, that can lead to bad things like overclassification. Your classifier seems to be working well, so the unclassified reads probably are not classified for a good reason.
I tried blasting some of those unclassified and also general Bacteria and those are host DNA. I imagine I would need to remove Eukaryota, Bacteria (at kingdom level only) and unclassified. The only problem is that when I tried removing Eukaryota, it seemed that those were moved to unclassified. If I remove all of these three, some individuals will have no microbes which is probably not true. I am redoing the analysis using trained classifiers based on specific V4 primers to see what I get. Then I will try majority taxonomy, and Greengenes too to see if doing any of these would change anything. I take your advice and would not play with the parameters for sklearn. I think I am still missing a lot of microbial content due to low biomass and I might end up having to re-sequences with V3V4.
I tried Greengenes, it gave me completely different taxa results. V4 primer-trained classifier was worse so I just used the pre-trained classifiers from QIIME2 website. I was able to remove Eukaryota as well as Unclassified but I can’t seem to be able to get rid of D-0-Bacteria;-;-;-;-;-. I tried different methods. None of them worked. Any hints?
instead of excluding, use --p-include p__. Then you will only include sequences that receive at least a phylum-level classification (so exclude anything classified only to kingdom-level).