After using various parameter combinations, how do I judge which Dada2 result is best?

Hello,

I hope I didn't oversee something but I googled a lot and didn't see an answer to this question so far.

My Qiime2 Version is 2019.10 .
My Dada2 commands look like this:
bsub -q verylong -n 4 -R 'rusage[mem=65000] span[hosts=1]' qiime dada2 denoise-paired
--i-demultiplexed-seqs demux.qza
--p-trim-left-f {values[0]} \ --p-trim-left-r {values[1]}
--p-trunc-len-f {values[2]} \ --p-trunc-len-r {values[3]}
--p-max-ee-f {values[4]} \ --p-max-ee-r {values[5]}
--p-n-threads 64
--o-table "table_"$params".qza"
--o-representative-sequences "rep-seqs_"$params".qza"
--o-denoising-stats "stats_dada2_"$params".qza";

I have dual barcoded paired end data from the V3/V4 region from an Illumina MiSeq imported via a manifest file. The demux quality plot shows 300 bases for forward and reverse sequences. The mean of the quality values goes under 30 in the reverse sequence at about pos 280 or 290. The appended screenshot shows a table with the results of multiple Dada2 runs. "params" shows trim-left-f_trim-left-r_trunc-len-f_trunc-len-r_max-ee-f max-ee-r . The number of features and the total number are the values from the resulting Dada2 table (first one in the table visualization). "# under 10000" means how many samples are left after setting the sampling depth to 10000. "Overall Length" is just the sum of bases left after trimming and truncation (simply calculated from the parameters). My assumption is that it is better to leave the forward and reverse sequences as long as possible and that I get less reads after dada2 when the sequences have been left long but that those reads are more correct. The following table columns show information that I get back from the cluster job. I tried to find out with which parameter combinations my jobs run fastest and/or use not so much memory. The last two columns show therefore the number of threads that I set in the qiime2 command and the number of nodes I defined in the bsub command.

Most importantly I want to know what my criteria should be to decide with which dada2 result I should do the rest of the analysis. Do I take the one with the highest number of resulting features? The one where I can choose a sampling depth of 10000 without losing too many samples?
Is the setting 0_0_300_300_22 or 0_0_300_300_88 better? 88 has less reads in total but more samples with enough reads to survive the 10000 sampling depth. Should I choose one of those two because I didn't trim or truncate anything and the result is ok?

A second thing is that I assume that I have to trim the sequences to remove the primers, therefore I'ver chosen 17 and 21 in one of the cases because the primers have these lengths. I assumed that I get more features out of this but it is much less. But I cannot leave the primers in the sequence, can I?

Do I consider the wrong parameters? Is there a specific column in the stats that I should check to compare the dada2 outputs?

Thanks a lot!

Luisa

Hello Luisa,

Welcome to the forums! :qiime2:

I've got good news and bad news...

The good news: you are asking all the right questions! :+1:

I'm glad you are thinking about the tradeoff between read quality and sampling depth, and about sequencing artifacts that could sneak into your data.

The bad news: if you want :1st_place_medal: quality answers, you need positive controls.

How many positive controls did you include? What is their composition?

Colin

Hello Colin,

Thanks for the warm welcome and fast reaction to my post :slight_smile: .

As controls we use Mock communities and water. The current dataset has one Mock (with seven bacteria which were only classified to level 5 with Silva) and six water samples. The other samples are low biomass samples.
Do you mean Mock communities with "positive control"? How many should we use?
For the next batch of data of this project we could use a Zymo mock community with about 21 strains.
For another project we have a synthetic DNA community of mouse strains.
As all the water samples have a lot less reads than the rest when looking at the sequencing outcome, should I use the dada2 result that represents this difference best? For example in some cases the mock community has less than 10000 reads similar to the water samples. So can I assume that the trimming/truncation was not good when samples that should have a lot more reads than water samples come close to their read numbers/feature count?

When I only look at the trim/trunc conditions that allow the most samples to come over the 10,000 reads border, I have the following table:


The seven samples with low read numbers are the six water samples plus one sample that had not much reads after the sequencing as well. "depth low" means the first read number that is above 1000 and "depth high" is the first sample read number that is above 10,000.
As all three parameter combinations seem to give a good and quite similar result, I would take the longest one, so 17_21_300_280_22 ? But how can I get such a similar result when I do the trimming at 17_21 and 70_70? So when I check the 50_50_230_230_22 combination together with those samples I get the following:

So probably all my untrimmed versions give crap results because of the primers that are not excluded (?).
In the 50_50_230_230_22 combi the water and the mock samples are under 830 reads, the one real sample that is under other trim/trunc conditions very close to the water has here 7079 reads. So I would like this combi if the mock wouldn't be so bad (186 reads). How can this happen, that the Mock community has in one case high read numbers and in this case such a low read number?

I hope I formulated my analysis approach not in a too confusing way...
Thanks a lot for your help!!

Luisa

Hello Luisa,

Perfect!

Yep, mock communities (or isolates) are two example of positive controls. You can also use synthetic spike in like ERCC

Even better! More controls, more options! Bring a truck load! :truck: :truck:

Now that I know what controls you have, we can use these as a benchmark to pick the method that works best, or understand the trade-off between settings.


Not necessarily... the mock community might be very different from that water samples.

I would start my analysis by focusing on the mock community, and choose settings that produce expected results. For example

I might see if another database or classification method would give me deeper taxonomic resolution.

When you start with the positive controls, you have a 'correct' answer you can aim at.

But if you start from the water samples...

No... maybe the water samples just sequenced well :woman_shrugging:


In your table, what is depth low and depth high? Are these the key settings you are using to judge the success of sequencing your positive controls?

Once we have a clearly defined success criteria for positive controls, we can take aim :bow_and_arrow:

Colin

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.