I did a quick search for questions related to this, but I have a large number of unassigned reads in low abundance samples (e.g., lung/BAL samples). Other samples such as gut, nasal, skin, TI, cecum reads are actually quite robust with unassigned reads taking up only < 5% of the relative frequency.
I've already performed a quick search and it looks like for the most part, it looks like people have dealt with it and generally it's an annotation issue? With classification parameters being too stringent (99% otu match). However, I would add that using UCLUST with QIIME 1.9.1 and picking open/closed we never have this many unassigned reads.
Some advice? I have attached that taxa_bar_plot screen shot. The yellow bars are unassigned:
What's even most interesting is that in PBS blanks that we run from the study we get around 5% unassigned which is acceptable, but in another study, again the Lung samples are dominated by unassigned. I'm pretty sure this is animal DNA. (brown being the unassigned BAL)
This is probably NOT an annotation issue (unless if you mean that lack of animal DNA in the reference is to blame). Do you have any posts to reference? This post may hold the answer.
Particularly in low-biomass samples, a high proportion of unassigned reads will probably be host DNA and/or other non-target DNA/artifact. Better than cross-contamination, which would be much more difficult to eliminate
I’d recommend doing a preliminary check (e.g., NCBI blast a few unassigned reads) just to see what these reads might be, then filter out all unassigned reads without giving it another thought.
That’s very stringent. I would lower that, personally. But that does not actually seem related to the issue you are having, since unassigned is only high in the low-biomass samples, suggesting that it may be some artifact/background noise/host DNA.
Closed reference OTU picking will remove these before you ever see them, because they do not match the reference database. Open reference would build novel OTUs, but the different filtering/chimera checking methods between QIIME 1 and QIIME 2 could be leading to this disparity.
Lower your % similarity threshold a bit, filter all unassigned features, and don’t look back.
edit/update: I just wanted to say that passing this method as pointing out in the other thread has resulted in removal of any unassigned K__bacteria reads. Thank you again. I had the opportunity to run it on the HPC, one of the sequence runs took me up to 24 hours with a 97% match, but I think it was worth it, the lung samples look as we expect them to look now. We did have to retrain a feature classifier. Thanks @Nicholas_Bokulich