Losing almost all reads after running deblur on paired-end sequences

mihcir · August 8, 2018, 9:14pm

Hi, I'm trying to use deblur to analyze paired-end 16S data. I joined my reads with "qiime vsearch join-pairs" and then quality filtered with "qiime quality filter q-score-joined". After checking the visualization, I still retained 86% of my original sequences, with the rest being filtered because of ambiguous bases (so far so good).

I then ran deblur using this code:

qiime deblur denoise-16S
--i-demultiplexed-seqs demux-filtered-paired.qza
--p-trim-length 407
--p-sample-stats
--o-representative-sequences rep-seqs.qza
--o-table table.qza
--o-stats deblur-stats.qza

Now when I visualize the table, I've lost 98% of my input sequences! When I look at the deblur-stats visualization, it looks like the problem is two-fold:

I lose an average of 60% because of "fraction-artifact-with-minsize" - even though the pre-deblur visualization showed that over 99% of my reads were above 407 bp (which is the value I used for p-trim-length); and
I then lose over 90% of the remaining reads in the column described as "The number of reads following Deblur."

As an example, in one sample I have 37,000 input reads. I'm left with 11,000 reads following "dereplication", with 70% lost because of min-size. Then it goes down to 2,500 following deblur. Then after that, only 982 "hit reference," despite only about 100 being thrown out as chimeric. So I've lost about 97% of my reads!

Final note, I've run deblur using only the forward reads on this same dataset and everything looks fine, with very high quality and good taxonomic resolution so I don't think the problem is with the data itself.

Any help is much appreciated!

Mehrbod_Estaki · August 9, 2018, 11:58pm

Hi @mihcir,

Thanks for the detailed explanation and your awesome detective work up to this point!
There's a combination of things that may driving what you see so let's take a closer look.

First I should mention that the loss of sequences with merged reads in deblur compared to just the forward reads has been observed and discussed before. A good discussion was had here and here. The latter link having a good calculation example of how many reads you may expect to lose with deblur, especially with longer reads. Personally I've also found this to be true when comparing DADA2 paired-end workflow to merge-->deblur approach. So the loss of reads is not unheard of, that being said I think you're still losing a bit more than expected, which just may be rooted in the nature of your data.

I know this sounds a bit confusing but the "fraction-artifact-with-minsize" doesn't have to do with the length-read but rather the --p-min-size parameter which sets the min abundance of a read to be retained, meaning most of your reads are singletons and so are discarded. It's hard to know why this is but a few possibilities are improper merging, not having removed primer/barcodes from your reads, or that there just is that many singletons (though unlikely imo). I would look at those points first before we do further troubleshooting. Can you also give us a bit more info about the nature of your run like sample type, primer sets, your run's cycles numbers?

mihcir · August 10, 2018, 12:10pm

Hi @Mehrbod_Estaki, thank you for those links, that was very informative! I didn't realize deblur was so conservative.

I know this sounds a bit confusing but the “fraction-artifact-with-minsize” doesn’t have to do with the length-read but rather the --p-min-size parameter which sets the min abundance of a read to be retained, meaning most of your reads are singletons and so are discarded.

I see, this makes sense. I believe the merging was fine, everything is the expected length, I'll attach the post-quality-filtered visualization here: demux-filtered-paired.qzv (301.2 KB)

This is V4-V5 16S sequencing on human fecal samples using the 515F/926R primers, on a MiSeq 300+300 pair end run.

Sorry for the extremely naive question, but how can I tell whether the primers and adapters are removed from the sequences? The data I receive is individual demultiplexed files (2 per sample) - I had assumed that since the software already demultiplexed it, it removes the primers and adapters too, but I don't actually know if this is correct. Based on the explanation of --p-min-size I see how it would be a big problem if they weren't removed.

Do you have any insight on why I still lose a further ~75% of the sequences that pass the p-min-size threshold? And then why another ~50% of the ones that pass min-size and deblur still not hitting the reference? When I run the forward reads only >99% of them are assigned to at least the phylum level in k_bacteria so I'm very confident that virtually all my sequences really are 16S...

Thanks again!!

Mehrbod_Estaki · August 11, 2018, 12:32am

Hi @mihcir,

Glad you found them useful! While conservative, deblur does an excellent job of denoising and likely produces less false-positives, check out this recent paper that compares it to DADA2 and UNOISE3. It does however seem to get more conservative with longer reads.

Thanks for posting your summary. Would you happen to also have the non-merged reads' summary? I'm noticing the overlap region seems to have an oddly perfect Q scores. I don't know enough about vsearch to know if this is cause for concern or not. Vsearch does use the same posterior q score calculation as usearch which would increase q-scores if there was consensus during alignment, BUT to see that many perfect scores still makes me wonder. Perhaps @colinbrislawn can confirm this behavior for us.

You can just take a peek at your original fastq files using something like the head command and just check the beginning of your reads to see if they match your barcodes/primer sequences. You can always ask your sequencing facility as well, usually they do remove these when they demultiplex but better be sure.

It's possible that is just the actual number of ASVs left after deblur, or perhaps since there are so few reads at that step that the error model fails to identify them as real sequences since there is not many other reads similar to them. The reference hit is based on a positive filter using the 85% greengenes database so again if your sequences had non-biological sequences in them they might fail to hit the reference. With the forward reads only they may pass that threshold but still only hit at the phyla level. Overall I'd say lets make sure the barcodes/primers are gone first before we go further down this rabbit hole. A lot of things can be happening here...

colinbrislawn · August 12, 2018, 10:05pm

I'm not sure if those high, high q-scores are normal. Let's see what the vsearch devs think!

GitHub issue crossref

colinbrislawn · August 14, 2018, 6:13pm

Yep, these very high q scores are possible. See: --fastq_mergepairs produces q-scores of 40 · Issue #326 · torognes/vsearch · GitHub

Colin

mihcir · August 15, 2018, 10:21pm

Hi @colinbrislawn, thanks for that link! Yes, I always see nearly-perfect Q scores in the overlaping regions of reads paired with vsearch and it hasn't been a problem in other analyses.

@Mehrbod_Estaki - I looked into the raw sequences and I think you may be right, the primers are still there! However, the adapters and barcodes are not - just the V4/V5 primers themselves. (To be totally clear, each read starts with the V4 515f primer GTGYCAGCMGCCGCGGTAA). Should these be removed? Or are they OK because they do technically align to 16S ?

I completely understand why barcodes would be a huge problem, but I can't tell whether the primers themselves are the culprit giving me all the difficulties.

wasade · August 20, 2018, 7:22pm

Just to clarify, the filter uses a 60% sequence identity against the 85% Greengenes OTUs. It is extremely permissive, but sufficient for filtering out things which are very unlikely to be 16S at all

wasade · August 20, 2018, 7:28pm

If the forward reads deblur well, than it's possible the stitching process was imperfect, or that the reverse reads are not great. Is it possible the reads are shuffled so the stitching creates chimeras?

If you run deblur directly (e.g., deblur workflow ...) you can get the output table that contains all of the sequences which did not pass the positive filter. Do those sequences actually appear to be 16S? Do they have any weird or obvious artifacts such as a string of N's?

Mehrbod_Estaki · August 21, 2018, 7:23am

wrt

The loci specific sequences which align to your target are ok to be left, its the non-loci specific overhang that we want to make sure are removed. Sounds like yours have been removed so not the issue.

system · September 21, 2018, 1:23pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.