Extremely high number of ASV in bacteria data

I guess I have a weird question, but if I compare my results with the results of other researchers seems like something is wrong.

I have processed 284 soil samples, sequenced by MiSeq 2x275 paired-end (v3-v4 amplicon), whose quality plots looked like this:

image
image

After removing the adapters and assuming that the amplicon length is 428 nt, I used DADA2 with many different parameters to choose the best option:

--p-trunc-len-f 240
--p-trunc-len-r 240
--p-trim-left-f 0
--p-trim-left-r 0
Number of ASVs: 39,292

--p-trunc-len-f 246
--p-trunc-len-r 244
--p-trim-left-f 5
--p-trim-left-r 5
Number of ASVs: 35,981

--p-trunc-len-f 240
--p-trunc-len-r 240
--p-trim-left-f 15
--p-trim-left-r 15
Number of ASVs: 38,695

--p-trunc-len-f 230
--p-trunc-len-r 220
--p-trim-left-f 15
--p-trim-left-r 15
Number of ASVs: 54,497

--p-trunc-len-f 230
--p-trunc-len-r 220
--p-trim-left-f 0
--p-trim-left-r 0
Number of ASVs: 55,242

--p-trunc-len-f 246
--p-trunc-len-r 244
--p-trim-left-f 0
--p-trim-left-r 0
Number of ASVs: 55,901

The number of ASVs increased as I truncated more and more the reads, which is the goal I want to achieve. Here, the denoising statistics from the last denoising parameters:
NGS100-21-RUN-1_denoising-stats.qzv (1.2 MB)

However, I have observed that other researchers, even with very similar data, obtain significantly fewer ASVs.... such as 1,000-2,000.
I have been able to assign taxonomy to these ASVs, but still confused.
Is this normal?
How can I check that I have not made any mistakes?

1 Like

There is a mistake in the post, the last truncating parameters were:

--p-trunc-len-f 230
--p-trunc-len-r 210
--p-trim-left-f 0
--p-trim-left-r 0
Number of ASVs: 55,901

My apologies.

Hello Cristina,

Welcome to the forums! :qiime2:

This is a great question!

I've summarized your results in this table so we can review.
I've added a column for the expected overlap.

trunc-len-f trunc-len-r trim-left-f trim-left-r ASVs overlap
240 240 0 0 39292 52
246 244 5 5 35981 62
240 240 15 15 38695 52
230 220 15 15 54497 22
230 220 0 0 55242 22
230 210 0 0 55901 12

Some observations:

  • Trimming 15 from the start of both reads does not appear to change the total number of ASVs much compared to trimming zero.
  • Decreasing the expected overlap DOES appear to have an effect on number of ASVs!

240 + 240 - 428 = 52 bp expected overlap
230 +210 - 428 = 12 bp overlap

While dada2 is able to join reads with only 12 bp of overlap, I wonder if this is not working as well.

Maybe your samples are much more diverse!

You compare the known composition of the positive controls you included on the run to their observed composition after processing. (Unfortunately, lots of people don't run positive controls. Do you have any?)

One cause of unintended feature inflation is barcodes getting into the ASVs. When you cluster these in a PCoA plot, do samples overlap at all, or is each ASV coming from one sample?

3 Likes

Dear Colin J Brislawn
Thank you for the reply!
I have generated a PCoA plot and the result is the following:


I still don't understand what's going on.... can you think of anything else I can check?

1 Like

Do you have any positive controls? We should investigate those, if you have any.

We could also pull out some of your ASVs, align them to the database, and see if we can see any barcodes inside of them that have not been removed.

1 Like

Colin, unfortunately I have not positive controls.
I've used FASTQC in order to look for Illumina adapters, however, all the files are free from them. I also assigned the taxonomy to all the ASVs generated and only 39 out of 55,901 were unassigned. The rest seem to have a biological sense.
Which database do you propose to use? I don't know how to recognize adapters within the reads.
Thank you!

Good question! There's a list of Illumina adapters over here:

As for taxonomy assignment, have you tried SILVA? I've gotten good results using that database. What database did you use?

Thank you Colin for the adapters file. I've found few of these adapters among the raw reads, however, no hits were shown among the representative sequences.
Yes, I used SILVA for taxonomy assignment and, as I said, the results made biological sense.
Let's say my samples are very diverse, I wonder if it is a good idea to use MUMU (which is the implementation of LULU in Linux) in order to cluster the ASVs, what do you think?

MUMU you say! I didn't know Frédéric had a new package out!

While we investigate why your ASVs are so numerous, and if that's a problem, perhaps you could open an issue on the GitHub repo and link this issue. I would love to know if this is a good use-case for MUMU.