DADA2 Trim Parameters on Low Quality Data

Hi

Due to the low quality of my sequencing run, I'm trying to measure the applicability of the Trim function in DADA2 but I am having issues understanding the best way to move forward.

I recently ran a multimarker metabarcoding library on a NovaSeq 6000 system. Due to some over-clustering issues, the quality is much lower than expected, especially at the beginning of each read. Below I've attached a quality plot of 18S V4 (Stoeck et al., 2010) amplicons after Cutadapt where the primer sequence was removed.

Although I am using 18S, the problem is almost identical for my other two markers.

As you can see the first 55 bases in read 1 and 73 bases in read 2 have very stark drops in quality compared to the rest of the sequence.

Currently, I am testing DADA2 parameters with a smaller portion of my sequences. It is the Trim parameter specifically that I am the most interested in and the most frustrated by. Attached below is a table summarizing the # of sequences and # of ASVs with different Trim parameters.

Table 1: Key Summary Statistics of DADA2 outputs for different Trim Parameters

Trim Read 1 Trim Read 2 Read length Mean % of seq after DADA2 # of ASVs
0 0 380-416 27.36 715
9 8 363-399 28.73666667 801
31 20 329-365 32.10333333 803
55 56 269-305 44.94 776
55 73 252-288 63.62333333 786

Trim locations were chosen to exclude as many low-quality locations as possible while retaining as many high-quality locations as possible.

As you can see, increasing the amount Trimmed increases the number of sequences passing the DADA2 filters yet the largest number of ASVs are identified at a moderate trim (31 Read 1, 20 Read 2).

I am worried about removing so much as to remove taxonomically distinctive areas but removing the low-quality regions has dramatically increased the number of sequences maintained and improved ASV retention to some level.

I have two questions:

  1. At what point can I consider that too much has been trimmed? Running DADA2 with the Trim parameter higher than this causes the # of ASVs to drop significantly (only 2-3 total ASVs with 230 reads). Is this the indicator that too much has been trimmed? I have been unable to find documentation discussing this issue.

  2. Should I choose Trim parameters based on the percentage of sequences that pass the filter or the # of ASVs identified? I would prefer to not miss features that are there but the # of ASVs dips before going up slightly with a higher trim percentage, this seems to suggest there are ASVs that are not being identified no matter which Trim method I use.

Thank you in advance.
Aleksei


Supplementary information:

There are two questions associated with these sequences. The first is comparing a multimarker metabarcoding approach to metatranscriptomics in eukaryotic identification. The second is characterizing the effects of freshwater connectivity in eukaryotic taxonomic composition in freshwater ponds under nutrient stress using a multi-marker metabarcoding approach. Both are environmental RNA collected from eukaryotic plankton that was reverse transcribed into cDNA before sequencing analysis.

All three markers were sequenced using the same Illumina IDT index. Cutadapt served a dual purpose, to both remove the primer sequence as well as demultiplex the three markers. In order to not have 70% of my sequences marked as undefined, I had to increase the number of allowed errors to 4 per primer sequence (p-error-rate of 0.18 for the 18S marker with slightly different for the other two to allow 4 errors per sequence).

Table 2: Summary Statistics of DADA2 outputs for different Trim Parameters

Trim Read 1 Trim Read 2 input filtered percentage of input passed filter denoised merged percentage of input merged non-chimeric percentage of input non-chimeric # of ASVs
0 0 616132 443696.33 73.48 438216.67 364489.67 62.15 159425.33 27.36 715
9 8 616132 460682.00 76.06 455396.67 380934.33 64.58 164189.67 28.74 801
31 20 616132 480805.00 79.18 475957.00 401643.00 67.80 170192.00 32.10 803
55 56 616132 517016.67 84.70 514436.67 447928.00 74.80 281215.33 44.94 776
55 73 616132 527313.00 86.32 524787.33 462345.00 77.10 384873.67 63.62 786

The table above includes the average of the # of sequences at each step of DADA2. Although each step is changed by the Trim parameters, it is the non-chimeric step which is impacted the most.

Table 3: Summary Statistics of DADA2 outputs for different Trunc Parameters when the Trim was (55 read 1, 73 read 2)

Trunc Read 1 Trunc Read2 input filtered percentage of input passed filter denoised merged percentage of input merged non-chimeric percentage of input non-chimeric # of ASVs
0 0 616132 527313.00 86.32 524787.33 462345.00 77.10 384873.67 63.62 786
230 225 616132 507007.3333 82.96333333 504584.6667 453351.3333 75.49666667 375198 61.91 821
213 210 616132 526114.3333 85.93333333 524009 450708 75.4 364475.6667 60.63666667 903

I also did some Truncation tests as seen in table 3. Although the # of sequences retained the # of ASVs increased dramatically so I will be using the Trunc parameter. That being said I’m less worried about the Trunc parameter as the end of each read is an overlap with its other read, as long as I have enough overlap there shouldn’t be any issue with Trunc.

I will be running initial taxonomic identification with PR2 using the different Trim parameters this week to see how it changes the identified taxa.

If there is anything else that I can get you please let me know.

Have a great day.

1 Like

Hello @AlekseiSychterz,

Unfortunately, it looks like you are correct that there is no easy truncation and trim amounts for you to choose based on your data.

Regarding question 1, yes I would say that definitely means too much has been trimmed. My understanding is there is often a hard drop off at some point that you should not cross.

Regarding question 2, I would imagine this depends on exactly what you intend to do with this data further down stream.

2 Likes

I see. Thank you for your reply!

You're welcome. If you still have any questions regarding ASVs vs percentage of sequences, we might be able to help answer those if you provide more information about what you are doing.