Hi
Due to the low quality of my sequencing run, I'm trying to measure the applicability of the Trim function in DADA2 but I am having issues understanding the best way to move forward.
I recently ran a multimarker metabarcoding library on a NovaSeq 6000 system. Due to some over-clustering issues, the quality is much lower than expected, especially at the beginning of each read. Below I've attached a quality plot of 18S V4 (Stoeck et al., 2010) amplicons after Cutadapt where the primer sequence was removed.
Although I am using 18S, the problem is almost identical for my other two markers.
As you can see the first 55 bases in read 1 and 73 bases in read 2 have very stark drops in quality compared to the rest of the sequence.
Currently, I am testing DADA2 parameters with a smaller portion of my sequences. It is the Trim parameter specifically that I am the most interested in and the most frustrated by. Attached below is a table summarizing the # of sequences and # of ASVs with different Trim parameters.
Table 1: Key Summary Statistics of DADA2 outputs for different Trim Parameters
Trim Read 1 | Trim Read 2 | Read length | Mean % of seq after DADA2 | # of ASVs |
---|---|---|---|---|
0 | 0 | 380-416 | 27.36 | 715 |
9 | 8 | 363-399 | 28.73666667 | 801 |
31 | 20 | 329-365 | 32.10333333 | 803 |
55 | 56 | 269-305 | 44.94 | 776 |
55 | 73 | 252-288 | 63.62333333 | 786 |
Trim locations were chosen to exclude as many low-quality locations as possible while retaining as many high-quality locations as possible.
As you can see, increasing the amount Trimmed increases the number of sequences passing the DADA2 filters yet the largest number of ASVs are identified at a moderate trim (31 Read 1, 20 Read 2).
I am worried about removing so much as to remove taxonomically distinctive areas but removing the low-quality regions has dramatically increased the number of sequences maintained and improved ASV retention to some level.
I have two questions:
-
At what point can I consider that too much has been trimmed? Running DADA2 with the Trim parameter higher than this causes the # of ASVs to drop significantly (only 2-3 total ASVs with 230 reads). Is this the indicator that too much has been trimmed? I have been unable to find documentation discussing this issue.
-
Should I choose Trim parameters based on the percentage of sequences that pass the filter or the # of ASVs identified? I would prefer to not miss features that are there but the # of ASVs dips before going up slightly with a higher trim percentage, this seems to suggest there are ASVs that are not being identified no matter which Trim method I use.
Thank you in advance.
Aleksei
Supplementary information:
There are two questions associated with these sequences. The first is comparing a multimarker metabarcoding approach to metatranscriptomics in eukaryotic identification. The second is characterizing the effects of freshwater connectivity in eukaryotic taxonomic composition in freshwater ponds under nutrient stress using a multi-marker metabarcoding approach. Both are environmental RNA collected from eukaryotic plankton that was reverse transcribed into cDNA before sequencing analysis.
All three markers were sequenced using the same Illumina IDT index. Cutadapt served a dual purpose, to both remove the primer sequence as well as demultiplex the three markers. In order to not have 70% of my sequences marked as undefined, I had to increase the number of allowed errors to 4 per primer sequence (p-error-rate of 0.18 for the 18S marker with slightly different for the other two to allow 4 errors per sequence).
Table 2: Summary Statistics of DADA2 outputs for different Trim Parameters
Trim Read 1 | Trim Read 2 | input | filtered | percentage of input passed filter | denoised | merged | percentage of input merged | non-chimeric | percentage of input non-chimeric | # of ASVs |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 616132 | 443696.33 | 73.48 | 438216.67 | 364489.67 | 62.15 | 159425.33 | 27.36 | 715 |
9 | 8 | 616132 | 460682.00 | 76.06 | 455396.67 | 380934.33 | 64.58 | 164189.67 | 28.74 | 801 |
31 | 20 | 616132 | 480805.00 | 79.18 | 475957.00 | 401643.00 | 67.80 | 170192.00 | 32.10 | 803 |
55 | 56 | 616132 | 517016.67 | 84.70 | 514436.67 | 447928.00 | 74.80 | 281215.33 | 44.94 | 776 |
55 | 73 | 616132 | 527313.00 | 86.32 | 524787.33 | 462345.00 | 77.10 | 384873.67 | 63.62 | 786 |
The table above includes the average of the # of sequences at each step of DADA2. Although each step is changed by the Trim parameters, it is the non-chimeric step which is impacted the most.
Table 3: Summary Statistics of DADA2 outputs for different Trunc Parameters when the Trim was (55 read 1, 73 read 2)
Trunc Read 1 | Trunc Read2 | input | filtered | percentage of input passed filter | denoised | merged | percentage of input merged | non-chimeric | percentage of input non-chimeric | # of ASVs |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 616132 | 527313.00 | 86.32 | 524787.33 | 462345.00 | 77.10 | 384873.67 | 63.62 | 786 |
230 | 225 | 616132 | 507007.3333 | 82.96333333 | 504584.6667 | 453351.3333 | 75.49666667 | 375198 | 61.91 | 821 |
213 | 210 | 616132 | 526114.3333 | 85.93333333 | 524009 | 450708 | 75.4 | 364475.6667 | 60.63666667 | 903 |
I also did some Truncation tests as seen in table 3. Although the # of sequences retained the # of ASVs increased dramatically so I will be using the Trunc parameter. That being said I’m less worried about the Trunc parameter as the end of each read is an overlap with its other read, as long as I have enough overlap there shouldn’t be any issue with Trunc.
I will be running initial taxonomic identification with PR2 using the different Trim parameters this week to see how it changes the identified taxa.
If there is anything else that I can get you please let me know.
Have a great day.