DADA2 Trim Parameters on Low Quality Data

AlekseiSychterz · October 22, 2024, 5:15pm

Hi

Due to the low quality of my sequencing run, I'm trying to measure the applicability of the Trim function in DADA2 but I am having issues understanding the best way to move forward.

I recently ran a multimarker metabarcoding library on a NovaSeq 6000 system. Due to some over-clustering issues, the quality is much lower than expected, especially at the beginning of each read. Below I've attached a quality plot of 18S V4 (Stoeck et al., 2010) amplicons after Cutadapt where the primer sequence was removed.

Although I am using 18S, the problem is almost identical for my other two markers.

As you can see the first 55 bases in read 1 and 73 bases in read 2 have very stark drops in quality compared to the rest of the sequence.

Currently, I am testing DADA2 parameters with a smaller portion of my sequences. It is the Trim parameter specifically that I am the most interested in and the most frustrated by. Attached below is a table summarizing the # of sequences and # of ASVs with different Trim parameters.

Table 1: Key Summary Statistics of DADA2 outputs for different Trim Parameters

Trim Read 1	Trim Read 2	Read length	Mean % of seq after DADA2	# of ASVs
0	0	380-416	27.36	715
9	8	363-399	28.73666667	801
31	20	329-365	32.10333333	803
55	56	269-305	44.94	776
55	73	252-288	63.62333333	786

Trim locations were chosen to exclude as many low-quality locations as possible while retaining as many high-quality locations as possible.

As you can see, increasing the amount Trimmed increases the number of sequences passing the DADA2 filters yet the largest number of ASVs are identified at a moderate trim (31 Read 1, 20 Read 2).

I am worried about removing so much as to remove taxonomically distinctive areas but removing the low-quality regions has dramatically increased the number of sequences maintained and improved ASV retention to some level.

I have two questions:

At what point can I consider that too much has been trimmed? Running DADA2 with the Trim parameter higher than this causes the # of ASVs to drop significantly (only 2-3 total ASVs with 230 reads). Is this the indicator that too much has been trimmed? I have been unable to find documentation discussing this issue.
Should I choose Trim parameters based on the percentage of sequences that pass the filter or the # of ASVs identified? I would prefer to not miss features that are there but the # of ASVs dips before going up slightly with a higher trim percentage, this seems to suggest there are ASVs that are not being identified no matter which Trim method I use.

Thank you in advance.
Aleksei

Supplementary information:

There are two questions associated with these sequences. The first is comparing a multimarker metabarcoding approach to metatranscriptomics in eukaryotic identification. The second is characterizing the effects of freshwater connectivity in eukaryotic taxonomic composition in freshwater ponds under nutrient stress using a multi-marker metabarcoding approach. Both are environmental RNA collected from eukaryotic plankton that was reverse transcribed into cDNA before sequencing analysis.

All three markers were sequenced using the same Illumina IDT index. Cutadapt served a dual purpose, to both remove the primer sequence as well as demultiplex the three markers. In order to not have 70% of my sequences marked as undefined, I had to increase the number of allowed errors to 4 per primer sequence (p-error-rate of 0.18 for the 18S marker with slightly different for the other two to allow 4 errors per sequence).

Table 2: Summary Statistics of DADA2 outputs for different Trim Parameters

Trim Read 1	Trim Read 2	input	filtered	percentage of input passed filter	denoised	merged	percentage of input merged	non-chimeric	percentage of input non-chimeric	# of ASVs
0	0	616132	443696.33	73.48	438216.67	364489.67	62.15	159425.33	27.36	715
9	8	616132	460682.00	76.06	455396.67	380934.33	64.58	164189.67	28.74	801
31	20	616132	480805.00	79.18	475957.00	401643.00	67.80	170192.00	32.10	803
55	56	616132	517016.67	84.70	514436.67	447928.00	74.80	281215.33	44.94	776
55	73	616132	527313.00	86.32	524787.33	462345.00	77.10	384873.67	63.62	786

The table above includes the average of the # of sequences at each step of DADA2. Although each step is changed by the Trim parameters, it is the non-chimeric step which is impacted the most.

Table 3: Summary Statistics of DADA2 outputs for different Trunc Parameters when the Trim was (55 read 1, 73 read 2)

Trunc Read 1	Trunc Read2	input	filtered	percentage of input passed filter	denoised	merged	percentage of input merged	non-chimeric	percentage of input non-chimeric	# of ASVs
0	0	616132	527313.00	86.32	524787.33	462345.00	77.10	384873.67	63.62	786
230	225	616132	507007.3333	82.96333333	504584.6667	453351.3333	75.49666667	375198	61.91	821
213	210	616132	526114.3333	85.93333333	524009	450708	75.4	364475.6667	60.63666667	903

I also did some Truncation tests as seen in table 3. Although the # of sequences retained the # of ASVs increased dramatically so I will be using the Trunc parameter. That being said I’m less worried about the Trunc parameter as the end of each read is an overlap with its other read, as long as I have enough overlap there shouldn’t be any issue with Trunc.

I will be running initial taxonomic identification with PR2 using the different Trim parameters this week to see how it changes the identified taxa.

If there is anything else that I can get you please let me know.

Have a great day.

Oddant1 · October 22, 2024, 8:57pm

Hello @AlekseiSychterz,

Unfortunately, it looks like you are correct that there is no easy truncation and trim amounts for you to choose based on your data.

Regarding question 1, yes I would say that definitely means too much has been trimmed. My understanding is there is often a hard drop off at some point that you should not cross.

Regarding question 2, I would imagine this depends on exactly what you intend to do with this data further down stream.

AlekseiSychterz · October 23, 2024, 7:40pm

I see. Thank you for your reply!

Oddant1 · October 24, 2024, 5:08pm

You're welcome. If you still have any questions regarding ASVs vs percentage of sequences, we might be able to help answer those if you provide more information about what you are doing.

system · November 24, 2024, 11:09pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.