I love a good mystery.
There are about 950 unique ASVs that are assigned only at Bacteria level. Randomly blasting a few of them results in things like this:
- Mus musculus targeted non-conditional, lacZ-tagged mutant allele F10:tm1e(EUCOMM)Hmgu; transgenic
- Mus musculus 10 BAC RP23-287L13 (Roswell Park Cancer Institute (C57BL/6J Female) Mouse BAC Library) complete sequence
- Mus musculus chromosome 12, clone RP23-68M8, complete sequence
All with 99-100% coverage. So it is likely the source of problem..
I had considered the PhiX as well, since I'm sure our facility spiked some to increase heterogeneity in the run for better yield, but these seem like host genes. Also, I can't really think how the PhiX would get introduced to specific samples (with barcodes) when it is added after that step.
Another point I just remembered is that a different subset of samples from this same run (for a different project) had a similar issue when I used DADA2 paired-ends. Back then I didn't dive into it and just used Deblur (forward reads only) instead and that seemed to eradicated all the unassigned and Bacteria;-only-assignments issue. Perhaps this makes sense since the pre-packaged error-model of deblur is specific to 16s and so would drop the host variants, whereas dada2 would do this without source discrimination. I think I'll add that to list of things to try, use forward reads with deblur, compare forwards with dada2 as well. Will keep you posted.
Thanks for letting me think out loud!