Cutadapt+ dada2: reads are removed in filtering

pau · March 30, 2021, 10:45am

Dear qiime community,
I have recently had some issues concerning the removing of primers and the following dada2-denoising. I searched in the forum for similar issues (to not be repetitive) and I found many concerning dada2-filtering read loosing and others about cutadapt, but alone (I have like a problem with the combination of these steps).

I am working with V3V4 paired-end reads. From my paired-end sequences (with no adapters) i removed the primers with the following command as suggested in Remove Primer in paired-end demultiplexed file - #12 by SoilRotifer

qiime cutadapt trim-paired
--i-demultiplexed-sequences paired-end-demux-A.qza
--p-cores 4
--p-front-f CCTACGGGNGGCWGCAG
--p-front-r NACTACHVGGGTATCTAATCC
--p-match-adapter-wildcards
--p-match-read-wildcards
--p-discard-untrimmed
--o-trimmed-sequences trimmed-demux-A.qza

Here, the trimmed demux.qzv (after cutadapt) trimmed-demux-A.qzv (321.7 KB)
I assume this step worked well, as I still retained the 91% of the starting Fw and Rv reads.

However, when performed the dada2-denoising i lost all the reads in the first filtering step. See denoising-stats-A-6.qzv (1.2 MB)

I've read in this forum and also stated in my own analysis that the keeping of reads in the filtering step has a lot to do with the truncation parameters. I assessed different truncations from the trimmed-demux based on the attached demux summary, but the read-killing in dada2 filtering was the same (or similar). I truncated the Fw reads at 237 and Rv at 227, 220 and 204.

Main question

Any idea about what could be happening? I could truncate even more the reads (as I still would have merging) as it is suggested Lost of data with dada2 - #14 by benjjneb. However, I don't understand why I would truncate more positions when they appear to be good quality in the attached demux-summary.qzv.

Other questions

I tried other ideas with fine results in this "dada2-filtering-overkilling" (not happening with these ideas), but which I assume are metodologically incorrect (if someone could help confirming it).

Not using the cutadapt-trimmed-demux, but directly the paired-end-demux and trimming the primer positions in dada2 (truncation parameters based on paired-end-demux summary, not attached).

iime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux-A.qza
--p-trim-left-f 17
--p-trim-left-r 21
--p-trunc-len-f 0
--p-trunc-len-r 225
--o-table dada2-A7/table-A-7.qza
--o-representative-sequences dada2-A7/rep-seqs-A-7.qza
--o-denoising-stats dada2-A7/denoising-stats-A-7.qza
--p-n-threads 24

This way those reads not starting with the primer are kept an trimmed and may result in junk sequences, DADA2 vs Cutadapt - #3 by Mehrbod_Estaki

Running cutadapt without the flag --p-discard-untrimmed \

That would carry the same problem as idea 1 I think. However, I don't understand why this option has no issue with the dada2-filtering as the main idea does.

Running the denoising directly from paired-end-demux (not cutadapt-trimmed) and trim only the first 5-13 positions (which are lower in quality). Nevertheless, there would still be 5-10 nucleotides from the primers, and that would be incorrect, right? Or that would not have an impact on the analysis otherwise?

I did more analysis, but I only attach the results that can help understanding the issue. The other options and parameters I assessed would only cause more confusion.

I'm sorry for the long post, but just wanted to make this post easy to understand for people helping me.
Thanks for the help

timanix · March 30, 2021, 12:42pm

Hi!
I am not from the team, but had similar issues. Based on your data, quality is dropping at the ends of your reads. Trimming your reads shorter may increase merging rate by removing bad parts of the reads. Also, I have an impression that dada2 removes all the reads that less than trimming parameters.
I would chose trimming parameters to trim as much of bad quality bases at the end as I can to keep overlapping region (if I am not mistaken, it was 20 in older Dada2 builds and 12 by default in newer).

pau · March 31, 2021, 9:05am

Hi @timanix ,
Thanks a lot for your quick answer!
I ran some dada2 increasing the truncation and worked, I had reasonable good filtering stats in dada2! However I still have 2 more questions maybe you may help with:

I've seen that with this increased truncation (always keeping the overlapping region) more reads are kept in the denoising. My question, is it better to truncate more positions and keep more but little bit shorter reads or, better to truncate less in order to have less but longer reads?
Any advice about the other options I mentioned? As I am quite new, I am not completely sure if they are correct or incorrect aproaches.

Thanks again for your help!!

timanix · March 31, 2021, 9:30am

Hi again!

Since you are keeping overlapping region, your final reads not necessarily will be shorter, since in that way you are removing bad quality bases in overlapping region, which will be merged anyway. So it is better to trim more, if it increases amount of good output reads.

You can try it, but I prefer to discard untrimmed sequences since they most likely will be filtered from your dataset. If you have some reasons to think that your primers are not removed with cutadapt, or you are suspecting that cutadapt is not working well enough on your dataset, you still can try option 3 you mentioned earlier.

There is nothing wrong with this option as well. With some datasets this approach worked better for me, but usually I use cutadapt to trimm primers. It is better to try both (if you are not certain which one to choose) and decide based on obtained results.

pau · March 31, 2021, 3:03pm

Hi again @timanix !

But this option of performing dada2 directly from paired-end untrimmed sequences would also carry those sequences that would be removed with the --p-discard-untrimmed flag, right? Therefore, it would be something similar to use trimming on dada2 or using cutadapt without this flag.

Thanks for helping!

timanix · March 31, 2021, 4:15pm

Exactly! This gives opportunity to test different scenarios for the data processing.

system · May 1, 2021, 10:17pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.