Is my Dada2 output normal?

Hi all,

I'm very new to Qiime2 and want to ask some basic questions to gain a bit more confidence about what I'm doing (hopefully).

Context is: I'm working with semen samples, and sent a very small initial sample of 5 (including a mock community) to Novogene to see if I could get any meaningful data at all. DNA extraction process has been tricky, hence the small initial look-see.

First re: importing data in Qiime2. Novogene sent me the raw fastq files and a set of files with barcodes and primers trimmed, so I have tried to import these trimmed ones into Qiime2. In terms of the importing step, these are definitely Casava 1.8 paired end reads (I can tell from the fields in the fastq file), but they have been renamed by Novogene and so I can't use the
--input-path casava-18-paired-end-demultiplexed \
command option.

Instead I used this option with a manifest file:
--input-format PairedEndFastqManifestPhred33V2
Does that sound right? I asked Novogene to clarify what version of Illumina software they are using to inform my choice of Phred offset, but i don't think they understood my question (or I didn't frame it explicitly enough) and told me this "for basecalling we use RTX3, and demultiplexing was bcl2fastq".

In terms of using dada2, the forward and reverse reads are 220 bases long. Read quality appears to be good across the entire length (remember only 5 samples, so not much data to draw from here), so I didn't want to trim them and specified p-trunc-len-f 220 for both f and r.

In my denoising stats visualization file, I seem to get really low percentage values of reads that are non-chimeric. For example in one of my samples, I began with 80463 reads, which dropped to 60334 post filtering, 59689 denoised, 2207 (!) merged. and 2177 (1) non-chimeric. So that's 2.71% non-chimeric overall.

Does that sound normal, or have I made a mistake somewhere? The highest value I have from my 5 samples is 20.47%. Also, the report Novogene sent to me does something very different, and they seem to have much higher values of non-chimeric reads. It isn't very clear how they have used Dada2 - they do state that they've used dada2 for denoising, but their initial QC and chimera removal is done with FLASH and Vsearch. All of that is beyond me.

Thanks for reading. I hope it makes sense, and happy to provide further details.

Hello and welcome to the forum!

That's making sense - looks like there are no issues with importing step.

To disable truncation, one should put 0 as a value instead of its max length.

Looks like most of the reads passed the filters but were lost on the merging step. Which region was targeted? I guess that V3-V4 since its amplicons are quite large and 220 for both reads may not be enough. Could you try to add --p-min-overlap flag to see if decreasing allowed overlapping region size will improve the merging step output.

Since it is a test run you may consider sequencing a smaller region (V4, V1-V2) or sequencing it with 300x2.

Thanks very much for your reply.
I will try all of these suggestions.

It was V3-4 that was targeted. I chose this because in the handful of studies on semen published so far, that is the usual region that is chosen, although that's no reason not to try something different.
I'll have a read up on the --p-min-overlap flag and play with this a little bit to see what happens.

I've had a look back at the report that Novogene sent me, and they have definitely merged reads with FLASH and VSearch prior to denoising with Dada2. Not sure what the rationale is for doing things this way but perhaps they get better merging. They have supplied me with some fastq files called "Clean Reads" which I think are post FLASH and VSearch, so I always have the option of importing these instead. I guess I just want to understand everything about the data I have, and what's been done with it (and why it was done that way).

Thanks again, I will report back.

So for a "big" study it would be better to sequence V3-V4 region with 300x2, or test smaller region. V3-V4 varies in size across taxa, so even with 60-70% of merged reads one could not be sure that data is not biased due to some reads that failed to merge because theq derive from taxa with longer V3-V4 region.

One can do it in this way but I would prefer to run the pipeline in qiime2 from the raw data. For example, if they just concatenated reads, it may affect taxonomy annotation step.

If merging is still not working another option would be to use only forward or only reverse reads in the whole pipeline.

Hi again,

thanks so much for your suggestions again.
I changed the --p-min-overlap to 4 and my results are now completely different. Getting 72% of reads now post filtering and merging on average (lowest 62% and highest 75%).
I take your point about the potential for bias using V3-4 and will put some more thought into an alternative smaller region for the larger study. Seems like the potential for bias is everywhere, so I'd like to reduce it whenever I can.

Can I also ask, as an aside, when training the feature classifiers, if there are specific parameters I should watch out for? For example, I'm thinking that the max length should be around 500.

Thanks once again for assisting a complete novice.

Glad that it worked!
For classifier training, I can recommend using this wonderful tutorial and Rescript plugin for Qiime2. It will also cover the length of sequences and also provide an example how to train targeted classifier with your primers.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.