Extremely high number of ASV in bacteria data

CrisZ · June 14, 2022, 1:31pm

I guess I have a weird question, but if I compare my results with the results of other researchers seems like something is wrong.

I have processed 284 soil samples, sequenced by MiSeq 2x275 paired-end (v3-v4 amplicon), whose quality plots looked like this:

After removing the adapters and assuming that the amplicon length is 428 nt, I used DADA2 with many different parameters to choose the best option:

--p-trunc-len-f 240
--p-trunc-len-r 240
--p-trim-left-f 0
--p-trim-left-r 0
Number of ASVs: 39,292

--p-trunc-len-f 246
--p-trunc-len-r 244
--p-trim-left-f 5
--p-trim-left-r 5
Number of ASVs: 35,981

--p-trunc-len-f 240
--p-trunc-len-r 240
--p-trim-left-f 15
--p-trim-left-r 15
Number of ASVs: 38,695

--p-trunc-len-f 230
--p-trunc-len-r 220
--p-trim-left-f 15
--p-trim-left-r 15
Number of ASVs: 54,497

--p-trunc-len-f 230
--p-trunc-len-r 220
--p-trim-left-f 0
--p-trim-left-r 0
Number of ASVs: 55,242

--p-trunc-len-f 246
--p-trunc-len-r 244
--p-trim-left-f 0
--p-trim-left-r 0
Number of ASVs: 55,901

The number of ASVs increased as I truncated more and more the reads, which is the goal I want to achieve. Here, the denoising statistics from the last denoising parameters:
NGS100-21-RUN-1_denoising-stats.qzv (1.2 MB)

However, I have observed that other researchers, even with very similar data, obtain significantly fewer ASVs.... such as 1,000-2,000.
I have been able to assign taxonomy to these ASVs, but still confused.
Is this normal?
How can I check that I have not made any mistakes?

CrisZ · June 15, 2022, 9:54am

There is a mistake in the post, the last truncating parameters were:

--p-trunc-len-f 230
--p-trunc-len-r 210
--p-trim-left-f 0
--p-trim-left-r 0
Number of ASVs: 55,901

My apologies.

colinbrislawn · June 15, 2022, 1:11pm

Hello Cristina,

Welcome to the forums! :qiime2:

This is a great question!

I've summarized your results in this table so we can review.
I've added a column for the expected overlap.

trunc-len-f	trunc-len-r	trim-left-f	trim-left-r	ASVs	overlap
240	240	0	0	39292	52
246	244	5	5	35981	62
240	240	15	15	38695	52
230	220	15	15	54497	22
230	220	0	0	55242	22
230	210	0	0	55901	12

Some observations:

Trimming 15 from the start of both reads does not appear to change the total number of ASVs much compared to trimming zero.
Decreasing the expected overlap DOES appear to have an effect on number of ASVs!

240 + 240 - 428 = 52 bp expected overlap
230 +210 - 428 = 12 bp overlap

While dada2 is able to join reads with only 12 bp of overlap, I wonder if this is not working as well.

Maybe your samples are much more diverse!

You compare the known composition of the positive controls you included on the run to their observed composition after processing. (Unfortunately, lots of people don't run positive controls. Do you have any?)

One cause of unintended feature inflation is barcodes getting into the ASVs. When you cluster these in a PCoA plot, do samples overlap at all, or is each ASV coming from one sample?

CrisZ · June 20, 2022, 6:38pm

Dear Colin J Brislawn
Thank you for the reply!
I have generated a PCoA plot and the result is the following:

I still don't understand what's going on.... can you think of anything else I can check?

colinbrislawn · June 20, 2022, 8:42pm

Do you have any positive controls? We should investigate those, if you have any.

We could also pull out some of your ASVs, align them to the database, and see if we can see any barcodes inside of them that have not been removed.

CrisZ · June 21, 2022, 9:55am

Colin, unfortunately I have not positive controls.
I've used FASTQC in order to look for Illumina adapters, however, all the files are free from them. I also assigned the taxonomy to all the ASVs generated and only 39 out of 55,901 were unassigned. The rest seem to have a biological sense.
Which database do you propose to use? I don't know how to recognize adapters within the reads.
Thank you!

colinbrislawn · June 21, 2022, 3:39pm

Good question! There's a list of Illumina adapters over here:

github.com

BioInfoTools/BBMap/blob/master/resources/adapters.fa

>Reverse_adapter
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Universal_Adapter
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
>pcr_dimer
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG
>PCR_Primers
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTCAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
>TruSeq_Adapter_Index_1_6
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_2
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_3
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_4
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_5
GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG
>TruSeq_Adapter_Index_6
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG

This file has been truncated. show original

As for taxonomy assignment, have you tried SILVA? I've gotten good results using that database. What database did you use?

CrisZ · June 22, 2022, 11:33am

Thank you Colin for the adapters file. I've found few of these adapters among the raw reads, however, no hits were shown among the representative sequences.
Yes, I used SILVA for taxonomy assignment and, as I said, the results made biological sense.
Let's say my samples are very diverse, I wonder if it is a good idea to use MUMU (which is the implementation of LULU in Linux) in order to cluster the ASVs, what do you think?

colinbrislawn · June 23, 2022, 12:28am

MUMU you say! I didn't know Frédéric had a new package out!

While we investigate why your ASVs are so numerous, and if that's a problem, perhaps you could open an issue on the GitHub repo and link this issue. I would love to know if this is a good use-case for MUMU.

system · July 24, 2022, 6:28am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.