What is a normal # of features from dada2 output


I have a similar issue. Can I jump in here please or should it be a separate query? I have got sequences from a MiSeq 2x300bp V4 run. there are 12 human fecal samples and I got 2,567 features. I have not got I am using QIIME2, 2019.7. I have attached here the paired-end-demux.qzv Paired_end_demux.qzv (290.6 KB) .
Then I did
qiime dada2 denoise-paired
--i-demultiplexed-seqs $IMPORT/Paired_end_demux.qza
--p-trim-left-f 10
--p-trim-left-r 10
--p-trunc-len-f 293
--p-trunc-len-r 293
--o-representative-sequences $FILTER/rep-seqs-dada2.qza
--o-table $FILTER/table-dada2.qza
--o-denoising-stats $FILTER/stats-dada2.qza

I have attached the stats-dada2.qzv stats-dada2.qzv (1.2 MB) .

I am now trying with primer cut offs of 20 (which was advised in another post), left truncation at 280, and right at 270, to check the difference.

So my questions are aren't the number of features too high i.e. the dada2cutoff not stringent enough? Am I on the right track?

Also attached here is the taxonomy0.7qzv taxonomy_gg0.7.qzv (1.5 MB) , if you have any comments please.

Thanks very much.

Hi @SetaPark,
I've moved your question to a new topic as it was different than the toriginal tohread.

I'm not sure what you mean by # of the features being too high. To see how many features your output has we would need the visualization summary of your rep-seqs file.
Which region are you targeting and what is the expected overlap size? If this was V3-V4 then I would say based on your quality score plots your truncating parameters are sensible.

Regarding your first run results, you do seem to have adequate reads by the end of DADA2 (enough to continue on anyways), though you do lose quite a bit through the initial filtering processes. This may be a good thing if there were alot of junk and contaminants in your run.

We have no information regarding your second run so can't really comment there whether that went well or not.

Looks like you have about ~2.5k features which may be pretty normal if you are looking at something like stool/gut tissues. But whether that is too high or too low really depends on your sample type and treatments involved in your experiment.

1 Like

Hello @Mehrbod_Estaki
Thanks for your reply.
I came a bit late into this study. So I am unaware of all details. This targets 16S V4 only, primers being 515F and 806 R, amplicon 427 bp.

The second DADA2 run (trim left 20, right 20, truncate L280, R 280) yielded (stats-dada2-2) stats-dada2-2.qzv (1.2 MB)

Hello again

Also attached here are rep-seqs for both the DADA2 runs. rep-seqs-2.qzv (677.6 KB) [rep-seqs.qzv|attachment]
(upload://qTHNDOAYgbuL27kYTpkri3LzugW.qzv) (600.0 KB)

Thanks for your help.

Sorry, rep-seqs2 for DADA2 run 1 is hererep-seqs.qzv (600.0 KB)

Hi @SetaPark,
Thanks for the updates.
It looks like your second run is much better than the first, both in retaining more reads and # of unique features.
Sorry I missed your original message that says this was V4 region. the 515F and 806R primers give you an amplicon size of about ~290 actually not 427 bp, so if for some reason they are 427 perhaps you do have V3-V4 region? Worth double checking.
Regardless, the second run is giving you better output because you discarded more junk tails of your reads and so retained more initial reads to denoise.
I would stick with the second run results for downstream analysis.

Hello @Mehrbod_Estaki

Thank you- I will use the second one for further analysis. I was told its V4 for sure. and that the primers are 515F_Nextera and 806R_Nextera. Deciding the deionising limits and the correctly analyse the stats.qzv is something I mull about.
So I tried...
version 3, L20,R20, truncF280, TruncR270stats-dada2-3.qzv (1.2 MB)

version 4, L20, R20, trincF280,TruncR255 stats-dada2-4.qzv (1.2 MB)

version 5, L20,R20,TruncF280,TruncR220, stats-dada2-5.qzv (1.2 MB)

How doers one decide which is the best ?

Thanks again.

1 Like

Hi @SetaPark,
If this is the V4 region then you have nearly complete overlap between your reads, which means you can be pretty strict with your trim/truncating parameters since there is no fear of reads failing to merge after.
In Version 5 you retain more reads at the end of dada2 (looking at last colun, non-chimeric), I would consider this the best run so far. You may retain even more if you were to trim your Forward reads more too. Truncating your reads earlier simply is discarding low quality tails which allows more reads to be retained/denoised. You version 5 is plenty good with your min reads/sample being over 37K! But if you really want to test the limits of trim/truncating power, you could try another run with the forward reads also truncating down to say 240-250. I suspect you will get slightly better outcome. But as I said, this is by no means necessary…you have plenty of good reads to move forward with.

1 Like

Hello again @Mehrbod_Estaki
I did further trimming, as you suggested, for the sake of completion.
So run 6 was L20R20, truncF240, TruncR220 with resultsstats-dada2-6.qzv (1.2 MB)
and a final run7 L20R20, truncF200, TruncR200stats-dada2-7.qzv (1.2 MB)

I ended up using run 5 for further analysis :slight_smile:
Hope this helps other users who (like me) wonder what is the best trim criteria.

Regards, and thanks very much for your prompt help.

P.S. is there such a forum for Qiime 2-R users? Thanks again

Hi @SetaPark,
Thanks for reporting on the extra runs. I do think your run 7 is by far the best one for analysis, and I personally would always use the run with the most reads but it may not change the overall picture at all. It would be rather interesting if you compared your results between those 2 runs too, for sanity’s sake.
I am not aware of any Qiime2-R forum but you can always try your luck posting on the Other Bioinformatics tool channel here, there are enough R users here as well.
All the best!

1 Like

Hi @Mehrbod_Estaki

Thanks very much. Will do for both.

Regards and best

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.