Hello all!
We have started working on a new project using QIIME2. The details of this project are summarised below.
Aim: Processing and analysis of multiple 16 S rRNA amplicon metagenomic datasets using the QIIME2 pipeline.
Input datasets: Demultiplexed single-end and paired-end human metagenomic datasets in FASTQ format, sequenced on Roche pyrosequencing and Illumina platforms.
Processing and analysis method: QIIME2 pipeline (v. 2023.2) with denoising step carried out using the DADA2 tool of the pipeline.
QIIME2 parameters: All default parameters except for trimming and truncation values for reads during DADA2 denoising.
Desired output: Counts table of ASVs of each dataset - generated from good quality reads with minimal data loss.
Downstream analysis: Taxonomy mapping of identified ASVs and diversity analysis.
Since this is our first time working with multiple datasets using QIIME2, we would like some insights on how to go about this process. The hypothesis of the project strongly depends on the final counts table output giving the number and accurate mapping of bacteria from the datasets. Therefore, any reads lost during the denoising and preprocessing steps would have a direct impact on the inference of the results and may cause biased outcomes. We would like to have minimal read loss across our datasets and want to know the best way to go about this.
The issues we have been facing include:
- Poor quality reverse reads in paired-end data
- Poor merging output despite having above-average forward and reverse read quality in paired-end data
- Significant number of reads were rejected as chimeras after good merging output
The questions we have are:
- Why are reads not merging despite having bases of high quality and a good overlap region?
- Should the default parameters of DADA2 be tweaked based on each dataset (which will lead to more variability)? Or can any other external tool be used for merging?
- Is it acceptable to proceed with forward reads alone when faced with low-quality reverse reads, especially when processing a large number of datasets? Many forum answers suggest going ahead with forwards alone when working with one dataset. But is it acceptable for a project of larger scale? Some paired-end datasets give high-quality output after denoising whereas others do not, and we are unable to predict this outcome based on the demux summary - that a high-quality summary would give good output isn't the case always (we have tried different trim/truncation/overlap values)
- We understand that the denoising output ultimately depends on the inherent data quality but we are unable to reject datasets based on the quality summary alone since the denoising output is sometimes superior for average-quality datasets and poor for some high-quality datasets. What is a good way to determine which datasets to work with?
Any help/suggestions would be highly appreciated. Thanks in advance!