I'm analysing some oral data; quite a simple dataset with 12 samples collected between 8 am to 6 pm.
Previously some of my students have analysed the data in QIIME1 with UCLUST 97% OTU picking and they got some nice results; e.g. the increase of Streptococcus and the decrease of Neisseria throughout the day.
So I wanted to use the data for a brief student workshop using QIIME2. However, when I processed the data through DADA2, the majority of samples ended up with unassigned taxonomy.
No biggie, I thought, maybe if I run Deblur it would filter out most of the stuff, which can't be assigned. Some sleuthing here on the forum brought this solution to daylight.
Well, here comes the odd stuff. Now the samples with unassigned taxonomy are reversed.
I get the same results whether I use Greengenes or Silva. Even if I train my own classifier (the data was generated with EMP primers) I end up with tons of unassigned (save for domain) sequences.
Has anyone got any clue what's going on here? The sequencing data is of pretty high quality.
Yes, from my experience they are from eukaryotic DNA which for some reason has homology with the 16S primers.
There is a quality control step, which requires a Vsearch filter to remove anything that doesn't hit with the 16S 97-99% homology with 16S which has dramatically cleaned up my samples.
edit: The clustering methods in QIIME1 has a "sort of" built in filtering step. Remember the usearch in QIIME1 requires 97% match either open or closed clustering for it to "hit". DADA2 does not do this, so anything that doesn't fit still remains in the sample, but becomes unclassified.
@ben, an interesting thought. However, wouldn't I expect a at least somewhat similar level of unassigned or kingdom-level assignment across all samples?
Here I have 100% kingdom/unassigned for samples 1, 11 and 12 and next to no problems for samples 2-10 - that is with data generated through Deblur. Then for the DADA2 data the situation is reversed, samples 2-10 are mostly unassigned/kingdom, whilst samples 1, 11 and 12 seem fine. Hence my confusion.
What's the percent homology that you use for Greengenes? edit: Sorry, I don't use SILVA so wouldn't know what the difference is between Greengenes vs. SILVA, but if it's a classifier problem it seems interesting that SILVA and Greengenes give such different taxonomy.
Oh sorry, go to DADA2-> rep_seq.qzv (assuming you built the file rep_seq.qza-> rep_seq.qzv) and click on some sequences - then they'll take you to BLAST. Ben
@Cistron,
The switching behavior is very strange. By any chance were the samples at 8:00, 17:00, and 18:00 sequenced on a different run or different primers?
Please give this a try to help troubleshoot: classify with classify-consensus-vsearch instead of classify-sklearn... that will help confirm or refute a sneaking suspicion I have about the data...
Is there isn't an easy way to filter that table for the unassigned sequences?
Clicking through a couple of sequences I got some uncultured clone hits, but otherwise nothing too much out of the ordinary.
@Nicholas_Bokulich, I shall give it a try and report back. All samples are from the same run, same primers. It was one big HiSeq run and I subsampled a little - still around 50k sequences per sample though.
classify-sklearn cannot currently handle reads in mixed orientations — it guesses the orientation of reads based on the first 100 or so, and then processes them accordingly, but if reads in the opposite orientation crop up it becomes discombobulated. Those samples for some reason must have their reads in opposing orientations.
For your workshop I would recommend just excluding those samples, or use classify-consensus-vsearch to avoid this issue.
Do you know how this could have happen? All samples were processed and sequenced in one batch on the same run. We used the EMP primers, which pretty much define orientation of the amplicon, shouldn't they?
EMP does define the orientation, so I do not know how this could happen but it really looks like that is the case! I think EMP may have changed the constructs a while back to put the primers on the forward instead of reverse read, but I am not 100% certain — any chance you are using a mix of old and new primer sets?
I remember the change. Though we order a new batch of primers every year, so they all are of the same orientation.
Could the sub-sampling have jumbled the orientation? I started off demuxing these 12 samples from a HiSeq run, sampled it down to 5% and eventually extracted the sequences from the artifact.
I wish I had time to test this with the full data, but I'm not looking forward to another 12 hour demux run.
Not possible. I aliquot the primers for the students. All from one plate.
I suppose it will remain a mystery for now.
I still want to thank you both again for all your time in investigating this issue. It's really great to have developers and experienced users here scanning the forums