Sequence length report after DADA2 denoise-pair

koch · April 24, 2018, 7:28pm

Hi,
I'm using dada2 to perform the quality filtering and merging paired reads. I'm wondering if there is a way that I can see what are the lengths of the merged fastq sequences? I know we can easily see the representative sequences, but I want to have an idea of how long the rest of the reads are. I tried the qiime feature-table summarize command but that only gives me the frequencies of features. Thanks in advance and sorry if I miss this information somewhere.

antgonza · April 24, 2018, 9:01pm

Hi @koch,

Just to be sure, do you mean the size of the overlap or the final sequence?

Anyway, remember that when you join sequences with DADA2, the denoising happen first and the joining after; thus, your truncate parameters are really important. Thus, the size of the overlap is "given" with the truncate parameters; perhaps this thread will help you if this is your question.

Now, once the sequences are denoised and merged, they become your "representative sequences", thus the final sequence length is the length of your representative.

Hope this helps.

koch · April 25, 2018, 3:49pm

Thanks @antgonza. I tested different truncation and maxEE so that part is fine with me. My question is rather about the length of sequences that get merged/passed quality trimming. For example, we see in the summary table (after run dada2 denoise-paired command) we have X number of "total frequences" which reflect how many pairs were merged/passed quality trimming. Can we know what are the lengths of this X number of sequences? I guess an easy way is to say: what are the lengths of these fastq/fasta? I read on another thread that we can't export fastq file after the dada2 denoise step (again, I know we can get representatives easily). If there is a report or summary of the total pool of sequences it could be useful, too.

From your reply it looks like the representative sequence is all I can get (in terms of the length and actual sequences). Let me know if I interpret it wrong.

Thanks very much for your help and explanation!

antgonza · April 26, 2018, 1:02pm

@thermokarst suggested looking at feature-table tabulate-seqs for a summary of your file; however, this doesn't give you any stats about sequence lengths or DADA2 runs. Now, this sounds like a possible good addition for the future, what about a summary of the input length of the fastq/fasta, the overlap between reads after denoising, and the final length; anything else?

koch · April 26, 2018, 6:52pm

That sounds great! I can't think of anything else in terms of the report now. However, if it's possible, the fastq file containing dada2 processed sequences will be useful, too. I know this was already mentioned in this thread. I think the file can be useful and gives people some flexibility. Thanks for your help and the suggestions.

antgonza · April 27, 2018, 2:11pm

@koch, just FYI, I just opened a new issue.

koch · April 27, 2018, 4:24pm

Thanks for letting me know. Looking forward to any update :).

system · May 28, 2018, 10:39pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.