Loosing about 40% of sequences in filtering step

fabipc · July 25, 2019, 6:41pm

Hello,

I have a quick question regarding DADA2. I am using primers (S-D-Bact-0341-b-S-17/S-D-Bact-0785-a-A-21) to amplify the 16s region. I have attached my file containing the demux-trimmed sequences from where I am choosing trim and trunc paramenters, but in the various settings that I have chosen, I keep loosing a high number of sequences in the filtering step.

I am not 100% sure what I am doing wrong and or how to properly choose a paramenter than results in less loses. It was suggested that I should increase my error rate when I run qiime cutadapt trim-paired, but I do not think that is the problem.

Do you have any suggestion for me, as I am unsure on how to proceed.

Bacteria-demux-trimmed.qzv (306.0 KB) Bacteria-stats-dada2-2.qzv (1.2 MB)

dimitely · July 26, 2019, 1:28am

maybe your adapter and primer weren't totally cut. You can try to trim a little bit more bp and see if the filter result will become better. I have similar problem before, and I trimmed about 20bp for the forward sequence in the dada2 process, and the filtered sequences drop to less than 10%.

Mehrbod_Estaki · July 26, 2019, 7:03am

Hi @fabipc,
While these are a bit higher than usual filtering rates, they are not terrible. Your merging, chimera detection etc are looking good too. At the end youu are still left with plenty of reads to carry on with your analysis as a worse case scenario.
My gut tells me there's something fishy going on with your forward reads, there does appear to be a step-wise drop in your quality scores which is not usual, doesn't happen in your reverse reads either, and also shorter read sequences start to appear earlier in your forward reads than they do in your reverse ones. I also notice that you are trimming from the 5' of your forward reads by 131 bp! Is there a specific reason for this? This is way too high, considering the great quality scores you have I would back that off to somewhere between 0-20 on that. No reason to get rid of so much info.
I've never tested this personally and I don't think this is the problem but maybe, just maybe, the error model is acting funny when you are starting it so late into the quality scores?
I would try rerunning it and see what happens. Worse case scenario you end up with the same amount of reads but with 131 bp higher reads.

Nicholas_Bokulich · July 26, 2019, 11:33am

Just echoing @Mehrbod_Estaki here, 40% is not too bad, and adding some more information for context:

I believe this sudden drop is because those reads are already trimmed by cutadapt — the reads longer than 270 nt are evidently so riddled with errors that cutadapt cannot trim them, so once you move beyond 270 nt (the max length of 90% of your reads), these "bad seeds" are left behind.

You also have a number of forward reads that are < 270 nt... around 10%. So by truncating the forward reads at 270, you are automatically losing these short reads. Instead of manually truncating, since you have paired-end reads you could try using the --p-trunc-q parameter to automatically trim reads where quality begins to drop too low. This may preserve some more of those short reads (though you might still lose them during merging?), while handling the "bad seeds" a bit more efficiently.

I agree! Usually the front trim parameters are used for removing primers, but you have already trimmed out your primers with cutadapt. So I recommend not using trimming on forward or reverse (unless if you know some adapter remains there). Continue to use the truncation parameters to truncate at the 3' ends.

Maybe reduce your trunc-len-r parameter to 220 or less if you can help it (if that still leaves enough to join the reads), unless if you use the --p-trunc-q parameter as I have recommended above.

fabipc · July 26, 2019, 7:04pm

Thank you everyone for giving me advice on how to proceed with the analysis, and for helping me understand DADA2.

As @Nicholas_Bokulich mentioned, I have already removed the primers using cutadapt, but I did not know that p-trim-left was used to remove primers. So thank you very much for helping me understand that. So given this information, I will perform a run with only the trunc parameter and see how it turns out. I will keep you posted.

I also notice that you are trimming from the 5’ of your forward reads by 131 bp! Is there a specific reason for this?

@Mehrbod_Estaki
I had chosen the first paramenters to trim the 5' at that point, just because I wasn't sure if I could choose the area in the bar plot where there was only the error bar and no actual bar, so I skipped all of those on the first run, just to test it, which resulted in not so great results. I will go ahead and try it w trimming at 20 or so.

As of today, I tested a few parameters and trimming at the lower end of the 5' gave me better results (45), but I am still loosing quite a bit on the filtering step. So I was quite confused. Sorry I did not load that file initially, but the code was running. I was about to test more paramenters today, kind of blindly based on what I understood of DADA2, but thank you for all of your advice.

I will go ahead and run use **--p-trunc-q parameter and then test one with only trimming at the 3'.**

Just to clarify, if if I use --p-trunc-q parameter, I specify the other parameters (trunc/trim)?
From what I see if the documents for DADA2 --p-trunc-q is used to discard any reads that are less than or equal to the value stated. It does say that default:2. Based on what I had read before on the tutorials it said a quality score of > = 20 is a good quality score to chose as a parameter, is this then the number that I should chose for this parameter?

**

UPDATE:

** After running my code with a trunc-q of 20, and having lost 80% of sequences to filtering, I found [this post (Dada2 denoise - why do I have so many reads filtered out - #11 by Mehrbod_Estaki) which had a great explanation of the trim-q value that I should work with, given that the default is 2. I will rerun my code w a smaller value and check in, in a different post under this thread.

Again thank you guys for all your help, I definitely could not get through all this work with out the forum and without all of your advice and help.

Nicholas_Bokulich · July 27, 2019, 11:29am

no, use either --p-trunc-q or the trunc parameters (since both of these control where 3' truncation occurs). You can still use the trim parameters with either --p-trunc-q or trunc (since trim controls 5' trimming and will not conflict with the truncation settings).

Sounds like you have figured out this parameter! Good luck and let us know how things go.

fabipc · July 28, 2019, 10:16pm

Hi Nicholas,

If I run the code, only using --p-trunch-q, I receive an error saying, (1/2) Missing option "--p-trunc-len-f".
*(2/2) Missing option "--p-trunc-len-r". So I had to enter both values, but I did not enter the trim parameters.

I just ran the code using
*qiime dada2 denoise-paired *
--i-demultiplexed-seqs Bacteria-demux-trimmed.qza
--p-trunc-len-f 229
--p-trunc-len-r 224
--p-trunc-q 0
--o-table Bacteria-table-dada2-14.qza
--o-representative-sequences Bacteria-RepSeqs-dada2-14.qza
--o-denoising-stats Bacteria-stats-dada2-14.qza

this is the output Bacteria-stats-dada2-14.qzv (1.2 MB)

So I can see that I did increase the number of filtered sequences by 10% so I am at 70%. I feel like this is a good number but I am testing out the code at q=0 just to see the output.

I do have a question, did I set up the code correctly?

Nicholas_Bokulich · July 30, 2019, 11:51pm

oops, you're right, --p-trunc-len is required. What I meant is to disable truncation by setting truncation to 0.

yes, 70% is good you could just proceed with that.

system · August 31, 2019, 5:51am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.