dada2 ASV inflation and complete separation in Venn Diagramm

dive.daniela · January 26, 2022, 2:26pm

Hi everyone,

I am not sure whether this is the correct platform to ask, as my query is related to dada2 (which I am using in R) but I found a thread in this forum which seems to come very close to my problem:

Just like the user in this threat, I have complete separation of ASVs for my different treatment groups when plotting a Venn diagram, which seems biologically impossible

Unlike the user in the thread I do not have different runs and I experience the same separation when looking at other treatments (Surface vs Deep, North vs South etc). I also feel like I end up with a lot more ASVs than I should (24k).

A quick overview of my data:
Illumina PE 300bp
18S V9 region
Amplicon size ~260 bp

I also got a lot of "Unknown Eukaryotes" in my data, which I thought was normal for V9 data but now I am not so sure. I had basically finished all my analysis already and was about to submit a manuscript draft when I noticed the issue with the Venn diagram...

I have already played around (a lot!) with the trimming parameters but it didn't seem to make any difference. The reads are generally high quality and when I trim a lot I get a high proportion of reads outside my expected amplicon size.

I appreciate any kind of advice, and even if it's just the directions to a more appropriate forum! I have no idea where to go from here as it makes me question my entire data analysis.

Thanks in advance!

colinbrislawn · January 26, 2022, 5:53pm

Hello Daniela,

Thank you for your detailed post. While there are several things that could cause this, the most common one I've seen is having barcodes end up in your amplicon reads, resulting in ASVs with a region that is unique to each sample.

This is the right approach to removing the barcodes. I'm sorry to hear it has not solved the problem...

Strange! Want to post your read quality graphs and also the various trimming commands you have tried so we can take a look?

dive.daniela · January 26, 2022, 6:57pm

Hi Colin,

Thanks for your response!
These are (a subset of) my read quality graphs:
Forward

Reverse

I have tried the following trimming parameters (with number of resulting ASVs after generating sequence table) - and other results I noticed:
170:160 (36991) - less overall sequence loss but many reads of the wrong size
180:170 (35590) - less overall sequence loss but many reads of the wrong size
210:190 (27885) - slightly more overall sequence loss (still acceptable) and at the appropriate amplicon size (these are the parameters I have been working with when I noticed the Venn Diagramm issue)
220:200 (64144) - a lot of sequence loss especially during merging and chimera removal, reads assigned to a slightly larger size than expected but all evenly assigned,

I hope that's helpful!
Thanks again

colinbrislawn · January 26, 2022, 10:42pm

Thanks, those graphs are very helpful!

The dada2 function filterAndTrim() supports several ways to truncate and trim these reads. Can you post the full functions you ran? (I'm not sure what 170:160 means...)

Based on those quality graphs, I bet that a short trimLeft combined with truncLen=c(180,190) would work well, but I'll wait to see your full commands before speculating more.

Based on the V9 primers you used, how long to you expect your amplicon to be?

dive.daniela · January 26, 2022, 11:24pm

I'm sorry I should have specified!

filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft = 15, truncLen=c(220,200), maxN=0, maxEE=c(2,3), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE)

Since the primers are already removed, I tried with an without the trimLeft command, which did not seem to make a huge difference in terms of ASV output or overall sequence loss - although I have not yet checked the Venn Diagram and taxonomy table for each pipeline edit as it takes quite a while to run.

The 170:160 etc. numbers I was referring to are the truncLen parameters I have tried.

With the primers I use I expect an amplicon size of around 260 bp.

colinbrislawn · January 27, 2022, 3:20am

Ah ok! This is very helpful. There is a lot to discuss here.

I get the numbers now. Does this table look right? (Check my math!)

trimLeft	truncLen R1	truncLen R1	total after trim&trunc	overlap	reads joined
15	170	160	300	40	36991
15	180	170	320	60	35590
15	210	190	370	110	27885

It makes sense to me that total reads joined drops as overlap increases, as the dada2 join method does not like errors in the region of overlap and longer overlaps will have more errors.

Yes, that should work

Oh no!

How did you go about removing the primers? My only guess here is that the program didn't work as intended, and there's still a barcode hiding somewhere in the ASVs...

Here's one way we could find that. If you post, say 50 of your reads and the barcode from one sample, we could do a MSA on them and look for a region that is suspiciously conserved. That could identify the location of a barcode within the existing reads that is not being trimmed out. We could also look for this region using the top ten most common reads from the full study, after dereplicating with vsearch. This can be a bit convoluted so let me know if you need a hand setting this up.

dive.daniela · January 27, 2022, 9:20am

colinbrislawn:

I get the numbers now. Does this table look right? (Check my math!)

trimLeft truncLen R1 truncLen R1 total after trim&trunc overlap reads joined

15 170 160 300 40 36991

15 180 170 320 60 35590

15 210 190 370 110 27885

I think I am confusing myself a bit at the moment trying to calculate the overlap. How did you calculate total after trim&trunc and overlap?
I just noticed I accidentally told you further up that I used 2x300 bp chemistry but it's actually 2x250. I've been using two different sequencing facilities of late with different methods so I mixed them up.
But yes I agree on the read drop with increasing overlap, that makes sense! But then what about the parameter where I barely cut anything off (truncLen=c(220,200) and I get most reads of all (64144)?

In this case the primers were apparently removed by the sequencing facility and I did have a look in Geneious when I first received the data and did not see any. I haven't used R much for sequence exploration (always worked in a GUI interface) so I hope this is still informative. This is a Geneious alignment of some of my raw reads from one sample with just the primer (F):

And the primer plus pads, linkers, and adapters:

I can't see any significant overlap here which is why I was content with everything having been removed.

BUT I just noticed this - This is an alignment of the primer with an extract of my processed ASVs which are unknown below Kindom level (unknown Eukaryote):

Now I am very confused!! The primers seem to be there again, how is that possible?

I don't actually have the barcodes for the individual samples, I guess I would have to contact the sequencing facility for that (and they can take a very long time to respond). Presumably we can solve the issue with what I just noticed about the primers, I just don't quite understand yet what is happening here.
I am happy to still provide the reads though if you would like to have a look - would those be the raw or processed reads?

Thanks a lot again for all your time so far!

dive.daniela · January 27, 2022, 9:29am

Small addition to my post:
I just noticed a similar thing is happening with the reverse primers. These again are the extracted ASVs and the primers align perfectly but not to the raw reverse reads:

colinbrislawn · January 28, 2022, 12:57am

Hello Daniela,

Great work. Let's dive back in!

Here's my mental model of the reads:

R1     |-----|-------------------->
R2               <--------------------|-----|
trimmed ^^^^^                          ^^^^^
overlap           ^^^^^^^^^^^^^^^^
joined       |------------------------| same as amplicon

So total after trim and trunc is
(truncLen R1 + truncLen R2) - trimLeft R1 - trimLeft R2
(170 + 160) - 15 - 15 = 300

Does that match your model of how read joining works? These assumptions can change with different primers and sequencing methods, so correct me if I'm wrong.

With 300 total bp of coverage of a 260 bp amplicon, I would expect 40 bp of overlap.

Oh, I missed that one. I'll add it to my table. I'm not sure why that would happen...

On the the MSAs! These are very helpful.

Yes, that forward read and forward primer look good to me.

I'm not sure. In your combined binding region, do you know the order of Illumina adapter, barcode, primer etc? I wonder if those bp left of the short binding region (GTTCG) could be part of the barcode still ending up in the read.

Is that second plot the most common ASVs from your full study, or from just a single sample?

That looks like it could be a barcode. Hard to know if the sequencing center did not provide the primers, but it's got the high variability in the middle of a conserved region, which is exactly what I would expect.

I really like your idea of reaching out to the sequencing company requesting more info about the primers. Did they remove the primers? I wonder what program and settings they used...

Short of that, I guess you could try increasing your trimming and truncating settings a lot to remove those regions entirely. You still have plenty of overlap in your reads, after all.

Let me know what you try next!

mverce · January 28, 2022, 9:40am

If I may chime in ... based on the screenshots, it looks like the primers used were those for Eukaryotic V9. For example, these two primers produce an amplicon of only 170 bp in Peronospora effusa (a BLAST hit). Also, you see the reverse complement of the reverse primer towards the end of your ASVs.

I think the problem is that you're dealing with amplicons shorter than what the Illumina sequencing was set to do (250 nt one way), so you get readthrough - the amplicons get completely read, including the opposite primer and Illumina adapters and some "gibberish" is added due to random noise(?) when Illumina is still running the cycles but the whole DNA molecule is sequenced. This would also be consistent with a sudden drop in your read quality profiles toward the end of the reads. If the forward primer is already removed, the solution would be to run cutadapt to remove the "3' adapters" (the reverse primer on the forward read and forward primer on the reverse read). Hopefully, that would lead to "clean" amplicon reads and better performance when merging.

colinbrislawn · January 28, 2022, 2:24pm

This readthrough could also contain the barcode from the opposite end of the read!

Great suggestion Marko!

dive.daniela · January 28, 2022, 2:38pm

What a fantastic suggestion, I think that perfectly describes the problem! Thank you very much!

The first thing I will try now is to only use the forward reads and see what results I get.

Thank you very much again to both of you.

dive.daniela · February 3, 2022, 10:30am

EDIT for future reference:
I have now tried only the forward reads which already gave me a much better result with less ASVs and shared ASVs between samples. I did however seem to get a lot more Unknowns below Kingdown level than before.

I tried to use cutadapt as suggested but struggled a bit with the proper installation on a windows machine so I contacted my sequencing facility who ended up removing the readthrough adapter, primers etc for me. I now get substantially less ASVs (around 4500) but everything else seems to look much better and I get a somewhat reasonable percentage of NAs (around 30%) at Supergroup level..

It is definitely worth noting that after removal of the readthrough 'rubbish' most sequences are only about 150 bp in length with few sequences remaining up to 250. So the dada2 method of generic trimming of all sequences without previous alignment seems to be slightly flawed in this case! I ended up not using the "truncLen" parameter at all on my trimmed (i.e. 'rubbish removed') sequences.

I hope this was helpful for other encountering the same issue with short amplicon sizes!

system · March 6, 2022, 4:31pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.