LosingSamples_dada2

Fatemah · May 22, 2019, 4:11am

Hello,

I am following the moving picture tutorial for a set of data. I used:

qiime dada2 denoise-paired \ --i-demultiplexed-seqs demux.qza \

---p-trim-left-r 20 \ # I also tried 0 and 10 as well.

--p-trim-left-f 20\ # I also tried 0 and 10 as well. the read quality is high in forward.

--p-trunc-len-f 219 \ # I also tired 250, 248 and other integers
--p-trunc-len-r 200 \ # I also tired 250, 215 and other integers
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza \ --o-denoising-stats stats-dada2.qza

In all these cases one sample is excluded and number 0 is assigned to this sample (see attached). I am not going to miss any sample for the analysis. 1) May I know what is the problem?
In the attached file, what do the filter, denoise, merge and non chimeric refer to? which one will be used or will be left at the end for the analysis?
In taxa bar plot there are too many k__Bacteria;;;;;__;_ in pink color what is the reason and how should I fix this problem?

Thank you for your support

jwdebelius · May 22, 2019, 8:20am

Hi @Fatemah,

I'm going to answer your questions slight out of order, if that's okay, to help work through things.

Dada2 goes through a series of steps to get your output table. First, it loads the sequences (input), then it does some quality filtering (filtered) followed by denoising (denoised), sequence merging (merged) and chimera removal (non-chimera). The is read left to right, meaning that the right-hand column are the final sequence counts.

So, there are two answers to this question. First, depending on your sample number, it's not a problem samples with low sequencing depth or low quality. Despite the fact that it costs you power, its often better to exclude a sample that is explicitly low depth or low quality (my rule of thumb is less than 1000 seqs/sample in my high biomass environments, people who work in other systems have their own thresholds). So, losing samples isn't necessarily a problem in and of itself because it happens. A lot. (If you can get away with it, Id recommend collecting 5-10% more samples than you think you'll need for your analysis or running replicates if you have a limited number and need all to amplify.)

However, I think the specific issue in your data is that your reads are failing denoising. And, my guess is that it's based on the first 20 or so basepairs an the last 25-50 of the reverse read. I'd try truncating your reverse reads at a shorter length to see if that solves your denoising problem.

This may depend on your classifier, region, and environment. If you used the greengenes classifier, did you use the V4 515F-806R primers or a different primer set? Are your working in an enviroment that should be well classified by the database (for instance, you might have more issues with extremophiles and should maybe look at Silva for those).

Best,
Justine

system · June 22, 2019, 2:20pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.