Data loss when trimming ITS

paulad · September 16, 2021, 2:53pm

Hi all,

I am analyzing some ITS sequences using this tutorial and I have some questions regarding trimming. After creating the demux.qza file, I used the qiime cutadapt trim-paired command to cut the adapters and got the demux-trimmed.qza file.

*qiime cutadapt trim-paired *
--i-demultiplexed-sequences demux.qza
--p-adapter-f GCATATCAATAAGCGGAGGA
--p-front-f GCATCGATGAAGAACGCAGC
--p-adapter-r GCTGCGTTCTTCATCGATGC
--p-front-r TCCTCCGCTTATTGATATGC
--o-trimmed-sequences demux-trimmed.qza

Then I proceeded to dada2 step with both non-trimmed demux.qza and trimmed demux-trimmed.qza to compare the results. My results from qiime dada2 denoise-paired were far from ideal (a high percentage of sequences failed to merge) so I also ran qiime dada2 denoise-single using only the forward reads. These analyses gave me a total of 4 options:

non-trimmed-single-end-table.qza
non-trimmed-paired-end-table.qza
trimmed-single-end-table.qza
trimmed-paired-end-table.qza

Where table no.1 (non-trimmed-single-end) is the table with the "best-looking" data, meaning that I could save the highest number of samples and use them in downstream analyses. While if I use the table no.4, I can do almost nothing with the data I have due to low feature count, and loss of too many samples in downstream analyses.

So my question is, may I proceed using my top choice table, or it is using the only the forward reads with non-trimmed adapters generally not a good idea? Is trimming an essential step even if it results in great loss of data?

For reference, my sequences are ITS2 region with ITS3 and ITS4 primers. The quality of sequences is pretty good (no need to trim or trunc in dada2).

non-trimmed-single-end-table.qzv (646.4 KB)
non-trimmed-paired-end-table.qzv (570.3 KB)
trimmed-single-end-table.qzv (459.9 KB)
trimmed-paired-end-table.qzv (444.0 KB)

thermokarst · September 20, 2021, 2:40pm

Hi @paulad!

There are two reasons to trim nts off of a read:

Remove low quality data
Remove non-biological data

Item #1 is (for all intents and purposes) optional, while item #2 isn't - downstream analyses will mostly be incorrect if you leave non-biological data in your reads.

Let's take a step back and try to understand what is happening when running q2-cutadapt - can you re-run the command above, but this time include the --verbose flag? Then, copy-and-paste the results here --- we can use that to diagnose what is going on in this step.

Thanks! :qiime2:

paulad · September 21, 2021, 1:28pm

Hi @thermokarst,
thank you for your answer and explanation! So if I understood correctly, in case I can successfully trim out non-biological data using q2-cutadapt, then trimming and truncation in dada2 is not necessary (if the reads are good quality of course)?

I re-run the q2-cutadapt as you suggested and here is the verbose output file.
verbose.txt (709.9 KB)

thermokarst · September 23, 2021, 7:33pm

Hi @paulad!

Yes! What matters is that the non-biological data is removed. For folks with uniform amplicon lengths using the trimming features in q2-dada2 is usually simpler (its one less step), but as you point out, with ITS its a little more complicated - q2-cutadapt is a great option for that!

Thanks!

There are quite a few warnings in this log - did you see those? There are ~270 warnings that look like these two:

WARNING:
The adapter is preceded by "A" extremely often.
The provided adapter sequence could be incomplete at its 5' end.

WARNING:
The adapter is preceded by "G" extremely often.
The provided adapter sequence could be incomplete at its 5' end.

Is it possible that you have a typo in the q2-cutadapt command and that you've possibly not provided the complete adapter sequence? Perhaps you can check with your sequencing provider to confirm the adapters used?

Keep us posted! :qiime2:

system · October 25, 2021, 1:33am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.