Retaining reverse reads during denoising and Quality summary vs denoising output of datasets

Hello all!

We have started working on a new project using QIIME2. The details of this project are summarised below.

Aim: Processing and analysis of multiple 16 S rRNA amplicon metagenomic datasets using the QIIME2 pipeline.
Input datasets: Demultiplexed single-end and paired-end human metagenomic datasets in FASTQ format, sequenced on Roche pyrosequencing and Illumina platforms.
Processing and analysis method: QIIME2 pipeline (v. 2023.2) with denoising step carried out using the DADA2 tool of the pipeline.
QIIME2 parameters: All default parameters except for trimming and truncation values for reads during DADA2 denoising.
Desired output: Counts table of ASVs of each dataset - generated from good quality reads with minimal data loss.
Downstream analysis: Taxonomy mapping of identified ASVs and diversity analysis.

Since this is our first time working with multiple datasets using QIIME2, we would like some insights on how to go about this process. The hypothesis of the project strongly depends on the final counts table output giving the number and accurate mapping of bacteria from the datasets. Therefore, any reads lost during the denoising and preprocessing steps would have a direct impact on the inference of the results and may cause biased outcomes. We would like to have minimal read loss across our datasets and want to know the best way to go about this.
The issues we have been facing include:

  1. Poor quality reverse reads in paired-end data
  2. Poor merging output despite having above-average forward and reverse read quality in paired-end data
  3. Significant number of reads were rejected as chimeras after good merging output

The questions we have are:

  1. Why are reads not merging despite having bases of high quality and a good overlap region?
  2. Should the default parameters of DADA2 be tweaked based on each dataset (which will lead to more variability)? Or can any other external tool be used for merging?
  3. Is it acceptable to proceed with forward reads alone when faced with low-quality reverse reads, especially when processing a large number of datasets? Many forum answers suggest going ahead with forwards alone when working with one dataset. But is it acceptable for a project of larger scale? Some paired-end datasets give high-quality output after denoising whereas others do not, and we are unable to predict this outcome based on the demux summary - that a high-quality summary would give good output isn't the case always (we have tried different trim/truncation/overlap values)
  4. We understand that the denoising output ultimately depends on the inherent data quality but we are unable to reject datasets based on the quality summary alone since the denoising output is sometimes superior for average-quality datasets and poor for some high-quality datasets. What is a good way to determine which datasets to work with?

Any help/suggestions would be highly appreciated. Thanks in advance!

Hello @B_603,

Welcome to the forum!

  1. Why are reads not merging despite having bases of high quality and a good overlap region?

To be able to begin helping you with this, we would need to see the relevant artifacts (demux summary, dada2 denoising stats).

  1. Should the default parameters of DADA2 be tweaked based on each dataset (which will lead to more variability)? Or can any other external tool be used for merging?

Yes they should. Adjusting the parameters to dada2 in a manner that takes into account the peculiarities of a specific run rather than adopting a blanket set of parameters for different datasets will lead to less variability introduced during analysis in your downstream data, not more.

Is it acceptable to proceed with forward reads alone when faced with low-quality reverse reads, especially when processing a large number of datasets?

Yes, but this is usually done after all options for retaining the reverse reads have been exhausted.

Regarding the questions about comparing results across datasets: differences in library preparation, sequencing technology, and analysis tools definitely need to be taken into account, but don't necessarily make any and all comparisons void.

Regarding questions about the seeming lack of correspondence between demux quality and denoising quality, there are too many variables at play to give you a useful answer. Among other things, you need to make sure that all the necessary quality control has been performed, that the amplicons and library prep were designed to allow reads to merge, that you are configuring dada2 properly etc. You mentioned that you are having an issue with chimeric sequences for example--this is sometimes influenced by primer sequences remaining in the reads.

Many of these questions are too general for us to be able to give much insight, sorry. You will probably get more useful answers to more specific questions or examples of problems.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.