I'm analysing some oral data; quite a simple dataset with 12 samples collected between 8 am to 6 pm.
Previously some of my students have analysed the data in QIIME1 with UCLUST 97% OTU picking and they got some nice results; e.g. the increase of Streptococcus and the decrease of Neisseria throughout the day.
So I wanted to use the data for a brief student workshop using QIIME2. However, when I processed the data through DADA2, the majority of samples ended up with unassigned taxonomy.
Yes, from my experience they are from eukaryotic DNA which for some reason has homology with the 16S primers.
There is a quality control step, which requires a Vsearch filter to remove anything that doesn't hit with the 16S 97-99% homology with 16S which has dramatically cleaned up my samples.
edit: The clustering methods in QIIME1 has a "sort of" built in filtering step. Remember the usearch in QIIME1 requires 97% match either open or closed clustering for it to "hit". DADA2 does not do this, so anything that doesn't fit still remains in the sample, but becomes unclassified.
@ben, an interesting thought. However, wouldn’t I expect a at least somewhat similar level of unassigned or kingdom-level assignment across all samples?
Here I have 100% kingdom/unassigned for samples 1, 11 and 12 and next to no problems for samples 2-10 - that is with data generated through Deblur. Then for the DADA2 data the situation is reversed, samples 2-10 are mostly unassigned/kingdom, whilst samples 1, 11 and 12 seem fine. Hence my confusion.
What’s the percent homology that you use for Greengenes? edit: Sorry, I don’t use SILVA so wouldn’t know what the difference is between Greengenes vs. SILVA, but if it’s a classifier problem it seems interesting that SILVA and Greengenes give such different taxonomy.
Is there isn’t an easy way to filter that table for the unassigned sequences?
Clicking through a couple of sequences I got some uncultured clone hits, but otherwise nothing too much out of the ordinary.
@Nicholas_Bokulich, I shall give it a try and report back. All samples are from the same run, same primers. It was one big HiSeq run and I subsampled a little - still around 50k sequences per sample though.
classify-sklearn cannot currently handle reads in mixed orientations — it guesses the orientation of reads based on the first 100 or so, and then processes them accordingly, but if reads in the opposite orientation crop up it becomes discombobulated. Those samples for some reason must have their reads in opposing orientations.
For your workshop I would recommend just excluding those samples, or use classify-consensus-vsearch to avoid this issue.
Do you know how this could have happen? All samples were processed and sequenced in one batch on the same run. We used the EMP primers, which pretty much define orientation of the amplicon, shouldn’t they?
EMP does define the orientation, so I do not know how this could happen but it really looks like that is the case! I think EMP may have changed the constructs a while back to put the primers on the forward instead of reverse read, but I am not 100% certain — any chance you are using a mix of old and new primer sets?