unassigned taxa - oddly reversed between deblur and dada2 pipeline so confused

Cistron · September 27, 2019, 11:48am

Dear all,

I've run into some very odd "data behaviour"?!?

I'm analysing some oral data; quite a simple dataset with 12 samples collected between 8 am to 6 pm.

Previously some of my students have analysed the data in QIIME1 with UCLUST 97% OTU picking and they got some nice results; e.g. the increase of Streptococcus and the decrease of Neisseria throughout the day.

So I wanted to use the data for a brief student workshop using QIIME2. However, when I processed the data through DADA2, the majority of samples ended up with unassigned taxonomy.

No biggie, I thought, maybe if I run Deblur it would filter out most of the stuff, which can't be assigned. Some sleuthing here on the forum brought this solution to daylight.

Well, here comes the odd stuff. Now the samples with unassigned taxonomy are reversed.

I get the same results whether I use Greengenes or Silva. Even if I train my own classifier (the data was generated with EMP primers) I end up with tons of unassigned (save for domain) sequences.

Has anyone got any clue what's going on here? The sequencing data is of pretty high quality.

Thanks,
Mike

ben · September 27, 2019, 12:59pm

Yes, from my experience they are from eukaryotic DNA which for some reason has homology with the 16S primers.

There is a quality control step, which requires a Vsearch filter to remove anything that doesn't hit with the 16S 97-99% homology with 16S which has dramatically cleaned up my samples.

edit: The clustering methods in QIIME1 has a "sort of" built in filtering step. Remember the usearch in QIIME1 requires 97% match either open or closed clustering for it to "hit". DADA2 does not do this, so anything that doesn't fit still remains in the sample, but becomes unclassified.

https://forum.qiime2.org/t/too-many-unassigned-or-only-at-kingdom-level-features/293

here are my examples:
Prior to a quality control step:

After a quality control step:

Ben

Cistron · September 27, 2019, 1:21pm

@ben, an interesting thought. However, wouldn't I expect a at least somewhat similar level of unassigned or kingdom-level assignment across all samples?

Here I have 100% kingdom/unassigned for samples 1, 11 and 12 and next to no problems for samples 2-10 - that is with data generated through Deblur. Then for the DADA2 data the situation is reversed, samples 2-10 are mostly unassigned/kingdom, whilst samples 1, 11 and 12 seem fine. Hence my confusion.

ben · September 27, 2019, 1:24pm

That's a good point.

Did you try to see what those sequences BLAST as?

What's the percent homology that you use for Greengenes? edit: Sorry, I don't use SILVA so wouldn't know what the difference is between Greengenes vs. SILVA, but if it's a classifier problem it seems interesting that SILVA and Greengenes give such different taxonomy.

Ben

Cistron · September 27, 2019, 1:33pm

I'm currently pondering how I would go about doing this. Do I export my representative sequence file?

99%, just as the moving picture tutorial.

ben · September 27, 2019, 1:34pm

Oh sorry, go to DADA2-> rep_seq.qzv (assuming you built the file rep_seq.qza-> rep_seq.qzv) and click on some sequences - then they'll take you to BLAST. Ben

qiime feature-table summarize
--i-table table.qza
--o-visualization table.qzv
--m-sample-metadata-file sample-metadata.tsv
qiime feature-table tabulate-seqs
--i-data rep-seqs.qza
--o-visualization rep-seqs.qzv

Nicholas_Bokulich · September 27, 2019, 2:05pm

@Cistron,
The switching behavior is very strange. By any chance were the samples at 8:00, 17:00, and 18:00 sequenced on a different run or different primers?

Please give this a try to help troubleshoot: classify with classify-consensus-vsearch instead of classify-sklearn... that will help confirm or refute a sneaking suspicion I have about the data...

Cistron · September 27, 2019, 2:07pm

Thanks, Ben!

Is there isn't an easy way to filter that table for the unassigned sequences?

Clicking through a couple of sequences I got some uncultured clone hits, but otherwise nothing too much out of the ordinary.

@Nicholas_Bokulich, I shall give it a try and report back. All samples are from the same run, same primers. It was one big HiSeq run and I subsampled a little - still around 50k sequences per sample though.

Nicholas_Bokulich · September 27, 2019, 2:12pm

See here but --p-include unclassified rather than --p-exclude

Cistron · September 27, 2019, 2:40pm

@Nicholas_Bokulich, here is the vsearch classifier output. This looks much more consistent, but why?

edit: probably not a bad idea to add my code. For some reason I couldn't download the GG files, so I used Silva.

qiime feature-classifier classify-consensus-vsearch \
    --i-query rep-seqs.qza \
    --i-reference-reads silva_132_99_16S.qza \
    --i-reference-taxonomy ref-taxonomy.qza \
    --output-dir vsearch \
    --p-threads 2 \
    --verbose

Nicholas_Bokulich · September 27, 2019, 3:41pm

classify-sklearn cannot currently handle reads in mixed orientations — it guesses the orientation of reads based on the first 100 or so, and then processes them accordingly, but if reads in the opposite orientation crop up it becomes discombobulated. Those samples for some reason must have their reads in opposing orientations.

For your workshop I would recommend just excluding those samples, or use classify-consensus-vsearch to avoid this issue.

Cistron · September 27, 2019, 5:26pm

Do you know how this could have happen? All samples were processed and sequenced in one batch on the same run. We used the EMP primers, which pretty much define orientation of the amplicon, shouldn't they?

Thanks for all your help!

Nicholas_Bokulich · September 27, 2019, 5:29pm

EMP does define the orientation, so I do not know how this could happen but it really looks like that is the case! I think EMP may have changed the constructs a while back to put the primers on the forward instead of reverse read, but I am not 100% certain — any chance you are using a mix of old and new primer sets?

Cistron · September 27, 2019, 5:34pm

I remember the change. Though we order a new batch of primers every year, so they all are of the same orientation.

Could the sub-sampling have jumbled the orientation? I started off demuxing these 12 samples from a HiSeq run, sampled it down to 5% and eventually extracted the sequences from the artifact.

I wish I had time to test this with the full data, but I'm not looking forward to another 12 hour demux run.

ben · September 27, 2019, 5:34pm

Our lab switched orientations and it threw our pipeline for a loop. Not sure when it happened either, probably within the last year.

Maybe someone who was prepping the amplicons ran out of primer and used the new primers not knowing the switch?

Nicholas_Bokulich · September 27, 2019, 5:40pm

doesn't sound like it, unless if the forward/reverse seqs somehow got mixed up.

This seems like a plausible theory

Cistron · September 27, 2019, 6:27pm

Not possible. I aliquot the primers for the students. All from one plate.

I suppose it will remain a mystery for now.

I still want to thank you both again for all your time in investigating this issue. It's really great to have developers and experienced users here scanning the forums

ben · September 27, 2019, 6:48pm

Apologies, I wasn't trying to throw any issues with the preparation, just trying to figure out how this could happen. Ben

Cistron · September 27, 2019, 8:03pm

No worries. That was a totally legitimate possibility.

system · October 29, 2019, 2:03am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.

unassigned taxa - oddly reversed between deblur and dada2 pipeline *so confused*

unassigned taxa - oddly reversed between deblur and dada2 pipeline so confused