Losing reads as chimeras

hsapers · April 24, 2020, 10:26pm

Hello
Thank you for all the help and support in this forum! I seem to be losing a lot of reads during denoising. I've checked a couple of posts that have suggested only using fwd reads, making sure primers are removed, and making sure there's enough overhang for merging. I think I've addressed these potential issues. I have paired end reads (2 x 251 kit, demultiplexed, 515f and 926r primers for v4-v5). I used cutadapt to remove the primers:

cutadapt \
-g ^GTGYCAGCMGCCGCGGTAA \
-G ^CCGYCAATTYMTTTRAGTTT \
-m 209 -M 255 \
--too-short-output untrimmed/${sample}_R1_too-short.fastq.gz \
--too-long-output untrimmed/${sample}_R1_too-long.fastq.gz \
--untrimmed-output untrimmed/${sample}_R1_untrimmed.fastq.gz \
--too-short-paired-output untrimmed/${sample}_R2_too-short.fastq.gz \
--too-long-paired-output untrimmed/${sample}_R2_too-long.fastq.gz \
--untrimmed-paired-output untrimmed/${sample}_R2_untrimmed.fastq.gz \
-o trimmed_fastq/${sample}_R1_001.fastq.gz -p trimmed_fastq/${sample}_R2_001.fastq.gz \
fastqfiles/${sample}_R1_001.fastq.gz fastqfiles/${sample}_R2_001.fastq.gz \
>> cutadapt_primer_trimming_stats.txt 2>&1

Looking at the cutadapt parameters, the expected number of bases are removed. I then used the following parameters for denoising:

qiime dada2 denoise-paired \
--i-demultiplexed-seqs run-1-demux-paired-end.qza \
--p-trim-left-f 0 \
--p-trim-left-r 0 \
--p-trunc-len-f 226 \
--p-trunc-len-r 200 \
--p-trunc-q 2 \
--p-max-ee-f 2 \
--p-max-ee-r 2 \
--p-chimera-method consensus \
--p-hashed-feature-ids TRUE \
--o-representative-sequences run-1-rep-seqs.qza \
--o-table run-1-table.qza \
--o-denoising-stats run-1-stats.qza

My demux data is here - lower quality than some of my other runs, but hopefully not horrible:
run-1-demux-paired-end.qzv (306.4 KB)

The denoising stat summary is here:
run-1-denoising-stats.qzv (1.2 MB). My percentage of non-chimeric input ranges from ~20% - 75% with most in the 35%-55% range. Percent of input merged ranges from 55% - 85%

I also seem to be losing a lot of reads with better quality runs: run-3-demux-paired-end.qzv (304.3 KB)

qiime dada2 denoise-paired \
--i-demultiplexed-seqs run-3-demux-paired-end.qza \
--p-trim-left-f 0 \
--p-trim-left-r 0 \
--p-trunc-len-f 228 \
--p-trunc-len-r 223 \
--p-trunc-q 2 \
--p-max-ee-f 2 \
--p-max-ee-r 2 \
--p-chimera-method consensus \
--p-hashed-feature-ids TRUE \
--o-representative-sequences run-3-rep-seqs.qza \
--o-table run-3-table.qza \
--o-denoising-stats run-3-stats.qza

that didn't seem to have much of an effect: run-3-denoising-stats.qzv (1.2 MB). In this run ~20% - 80% were non chimeric with most in the 30 - 40% range. Here 50% - 80% merged, so I'm not sure it has to do with low quality scores. Merging looks ok as well, so I think overhangs are sufficient.

Is it possible that cutadapt wasn't very effective and I might have a lot of primers hanging around? Maybe I shouldn't have anchored the removal to the 5' end in case there were a few early insertions?
Are my denoising parameters too strict?
Am I miss-reading the report: Each of the summary statistics (percent that pass through filtering, percent that pass through denoising and merging, and percent that are non-chimeric are all given with respect to the total number of input reads, so I don't know really how many reads passed through to the 'next step.' The input for each step doesn't match the output of the previous.) Am I correct for the first row of the run030denoising-stats (sorted by total input) to say that I started with 55871 reads, 49324 of those passed the initial filtering, of those 49324 filtered reads, 47217 passed through denoising, of those 47217 denoised, 35438 were successfully merged, and of those 35438 that merged only 20809 were non-chimeric?
What part of the denoise-paired command constitutes filtering vs denoising? Is the filtering based on the trunc-len and trunc-q parameters and the denoising based on the allowed expected error (EE) with the ad-hoc error model applied?

Thank you - apologies for the long post and all the questions. Definitely feels like as soon as I feel I'm starting to understand, I have so many more questions!

Nicholas_Bokulich · April 28, 2020, 3:16pm

Hi @hsapers,
The chimera filtering does not look too unusual, maybe a bit on the high side, but the % merged looks a little more worrying to me. You may want to truncate a bit less if you can to try to improve merge yields.

For chimera filtering, you may want to look into the --p-min-fold-parent-over-abundance parameter. There are a few topics on this forum about that parameter and the chimera method that are worth reading before deciding if those issues apply to your data.

You could always check, e.g., look at sequence length summaries before/after cutadapt. If you know the primers are at the 5' end you can use the trim parameter in dada2 to trim there, instead of using cutadapt.

No, these look pretty close to the defaults, though as I mentioned above you may want to truncate less.

Since the percentages are all relative to the input, you can subtract to determine how many you are losing at each step (with respect to the input). You can also download this file as a TSV (see the left-hand corner) and make your own percentages in R, excel, etc, if you prefer.

Each column (except input and percentages) is showing you the number that passed that step, but percentages are only given for some of the critical steps.

Correct

tuncation and trimming happens first; then pre-filtering occurs based on the max EE. Then the error model is trained and applied to correct expected errors.

I hope that helps!

hsapers · April 30, 2020, 9:01pm

Thank you @Nicholas_Bokulich - that was really helpful. I'm assuming that successful merging is based on a trade off between having a sufficient overlap and having sufficient confidence in the base calls over the region of overlap? How do you decide which is more important and does read merging take into account error based on the phred scores, if so what error model is used? Before I read your post I tried to cut off more to increase the confidence of the base calls over the overlap, but further shortened the region of overlap - the results were quite similar. I'm working on the opposite now, following your suggestion to lengthen the region of overlap but include more lower quality base calls. The Mohsen et al., 2019 paper suggests that a threshold Q-score of 18 makes a big difference in merging ability. Since even for my lowest quality reads, the median is above this threshold, will lengthening the region of overlap have a greater positive effect on merging than the increased number of worse quality base calls would have a negative effect? In other words, are my low quality reads not low enough to really be concerned about them? I've also read in a number of places that at least 12 nucleotides are required for merging. This seems quite low to me, are you aware of any literature optimizing the length of overlap, or is bigger always better in this respect (as long as the q scores are good enough)? Thank you for all of you help

Nicholas_Bokulich · April 30, 2020, 10:16pm

Indeed! this is a balance between truncating enough so that dada2 does not filter out too many errors, and leaving enough so that the reads overlap enough to merge (min 12 nt overlap in the current release).

As a rule of thumb, I always truncate less: dada2 filters out more reads due to error (presumably a random process), but the reads that remain are merged — so the measure I aim to optimize is the # of reads merged. It is far worse to truncate more, aiming to optimize the number of reads passing filtering, if this means that a larger fraction are removed during merging, because reads lost during merging could bias against taxonomic groups with longer amplicons for the targeted region.

Merging is done after denoising, so does not take phred scores into account at that point. I am not totally sure what alignment algorithm or mismatch threshold is used at that point, you should probably check out the dada2 documentation for details.

dada2 will toss out any reads with > 2 expected errors (adjust this with --p-max-ee if you dare ) so truncating less will just lead to more reads being thrown out by the filtering step prior to denoising, it will not necessarily lead to issues with merging because the tails have more errors, though that is an element. We are talking about a difference of a few nucleotides here, though: you should still truncate as much as you can to merge with as little as possible, but given the length variation in any amplicon (fortunately for you not too much with 16S) you need to provide enough wiggle room so that you do not cause a taxonomic skew. So I think this recommendation is still consistent with Mohsen et al.'s findings: truncate to remove low-quality tails, but not so much that you skew the taxonomy! This is a much bigger issue with non-16S targets (like ITS, which has very high length variation in bacteria and fungi), but something to be aware of nonetheless.

Again, dada2 will throw out those reads if they are too low quality, so don't worry too much about the effect that those errors will have. Truncating leads to more reads passing the initial quality filter, and more being denoised, but you do indeed need to leave enough sequence to permit paired-end merging.

Yes this is just the minimum alignment length that dada2 uses for deciding whether to "trust" a merge. It is not much, though the subsequent chimera filtering should catch anything too wonky. The benchmarks that various groups have done on dada2 (including my own personal benchmarks using some old mock community data with short-ish read lengths so relying on minimal overlap) convince me that the paired-end merging should be working well: if bad read merging were present, you'd expect all sorts of false-positives but not of the published benchmarks seem to indicate this scenario.

I am not really aware of any good literature on this specific topic using dada2, but agree that if quality is high enough then bigger will (in theory) be better. However, since in practice longer sequences inevitably mean more accumulated error, truncating at a reasonable position is usually best practice. Since 16S has very little length variation in general, it is usually possible to arithmetically estimate a reasonable minimum truncation length.

I hope that helps!

hsapers · May 1, 2020, 2:52am

Thanks @Nicholas_Bokulich - I think I'm a little confused now though - you mention that you always truncate less (more reads removed during quality filtering) rather than truncating a lot and risking reads not being long enough to merge and being discarded at the merging step. But it's also mentioned that one should truncate as much as possible to merge with as little as possible - combing these - does this mean truncating as much as possible such that as many reads as possible have a 12 nt overlap? So I want to see ~the same number of reads merging as reads passing denoising which would mean that I'm losing reads to accumulated error-prone bases rather than losing reads because they're too short to merge? Towards the end of your post it seems that truncating more (as long at least 12 nt are left for merging taking into account amplicon length variability) is preferred - but contrary to your original advice to truncate less - even though with these parameters at least 98% of my reads should have 51 bases of overlap - shouldn't this be more than sufficient for merging? Could there be something else preventing reads from merging successfully? Sorry if I've misinterpreted anything.

Nicholas_Bokulich · May 1, 2020, 2:45pm

To put this another way: figure out a reasonable ballpark length (to account for length variation in the target amplicon) and truncate down to that length but no further.

So if you have a mean 450 bp amplicon and 250 nt long amplicons (this is an example, I don't know specifics for your amplicon), and you've read in the literature that the amplicon has length variation of around ± 20 nt***, we'll set 470 nt as the max length. You need to account for 12 nt of overlap. So 470 + 12 = 482 nt minimum combined length is what you will require. So with 250 nt reads you have only a bit of wiggling room... trimming to 240 + 245 would be okay, but 240 + 240 might be just a little too short but in that grey area and it may be worth truncating less on one read of the other even if it means picking up a few errors in the reads and letting dada2 handle it... look at the dada2 stats to determine if you are losing too many reads at merging.

***you may or may not be able to get length variation estimates from the literature, or they may not be accurate. You could assume something like ± 20 nt for most 16S subdomains; you could just try a few truncation lengths and see how it impacts your dada2 merge yields. You could also make your own estimates by using qiime feature-classifier extract-reads on a reference database of choice and look at the length variation.

Right!

Balance is everything

That should be plenty! But again, length variation could account for the reads that fail to merge. It would be worth exploring some other truncation settings to optimize this step.

I hope that clears up the apparent inconsistencies!

hsapers · May 1, 2020, 5:06pm

Thanks so much @Nicholas_Bokulich for the detailed explanations and re-explaining everything. I'm going to to do a bit of a ref search to see if I can find any information on the expected length variability of the amplicon - I missed that in your fist post. I was thinking of length variation in the reads as a result of sequencing variability - not in the target amplicon itself - makes so much more sense why this could lead to taxonomic bias. I'll also plot some ECDFs of denoising and merging stats to see if there are meaningful differences in mean and stdev that I might be missing just scanning through the data

hsapers · May 1, 2020, 10:50pm

In case anyone comes across this thread and was interested in looking at SSU hypervariable region length variation, this was an informative reference for me: Vargas-Albores et al., 2017 Size-variable zone in V3 region of 16S rRNA

Mehrbod_Estaki · May 2, 2020, 4:02am

This is really cool, thanks for sharing this paper @hsapers.
I had always noticed V3-V4 region had this bimodal distribution in my own data (see below), but never dug into it properly. Glad to finally figure out it is a real biological signal and where it comes from (V3).

hsapers · May 18, 2020, 6:36pm

Thanks again - I'm back working to optimize this and put together a couple visualizations (I'll post the code if I get it cleaned up and properly annotated). But I've tried three different truncation positions (224, 200, and 170) all should leave enough overlap (at least 12 nt) even for slightly shorter target regions. Based on the summary plot, it looks like I'm losing very similar proportions through each step with the biggest losses in initial filtering and chimeras. Retaining slightly more reads with more truncation (losing less in filtering) but it doesn't seem to be making much of a difference in the proportion that merge. Looking at my cutadapt output, I have a number of very poor samples - I will try removing these and see what happens. Thanks again

Nicholas_Bokulich · May 18, 2020, 6:56pm

Nice plot! Thanks for following up.

Trimming more results in much less loss at the filtering step, as expected, but the same amount lost at merging (so this may just be inevitable loss, e.g., maybe these seqs are outliers that cannot merge with those read lengths? non-target DNA? )

Looks like you have a winner with 170

The loss at merging also does not look that bad, based on that plot... at this point I would move on, and if you really want to be sure that merging is not a big issue here, compare taxonomic profiles on forward reads only vs. paired-end reads. (of course profiles will look different — but the hope is that at family or genus level if you manage classifications that deep they will not look too different).

those chimeras look like the biggest loss at this point — perhaps validly, perhaps not . You could try adjusting this parameter as I described above to see if it helps, and how it impacts taxonomy classification (e.g., do you start seeing weird clades crop up as you relax chimera filtering? as opposed to just "more of the same". New = probably bad in this case):

Good luck!

system · June 19, 2020, 12:56am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.