dada2 denoise issue

Hi everyone,

I’m running a bioinformatics pipeline with QIIME2 on 16S and ITS data. I’ve hit an issue when trying to run the DADA2 denoising step for my 16S data. The script starts successfully but fails during the execution of DADA2 with an error.

What I’ve done so far:

  1. Data Preparation: I’ve uploaded raw FASTQ files, run trimming, and prepared the data for DADA2.
  2. Pipeline Setup: The pipeline is designed to dynamically load the metadata and process paired-end FASTQ files for 16S and ITS.
  3. DADA2 Execution: I’m calling DADA2 using QIIME2 with truncation at 250 bp.

The log shows that the average read length for my 16S data is 283.0 bp, and it proceeds to DADA2 with truncation. However, I get the following error when DADA2 starts:

swift

Copy

ERROR conda.cli.main_run:execute(125): `conda run python /home/XXXX/scripts/dada2_denoise.py` failed. (See above for error)

Things I’ve already checked:

  • QIIME2 Environment: QIIME2 is properly installed in the qiime2-amplicon-2024.10 environment.
  • Input Files: The input QZA files (e.g., demux-paired-16S.qza) exist and are correctly placed in the specified directory.
  • Output Paths: All output paths are valid, and I’ve confirmed the necessary directories exist.

Questions:

  • Has anyone faced a similar issue with DADA2 failing to run properly, even with valid input files and paths?
  • Are there any specific QIIME2 or DADA2 configuration settings I may have missed that could cause this error?
  • How can I capture more detailed error messages to diagnose the issue further?

I appreciate any help or suggestions. Thanks in advance!

1 Like

Hello Colin,

Welcome to the forums! :qiime2:

Thank you for including as much detail as possible. You are right, we are going to need to find that that log file to learn more about the error.

Can you ask the folks who built this pipeline to help you find it? Or you can post a link to the code here and we can take a look.

Yes! Once we have the log file, we can search the forums for that error!

Hi, thanks for the reply! I’m the one working on building the pipeline.

I’ve just realized that the raw FASTQ file I’m using is an interleaved FASTQ file, meaning it combines both R1 and R2 into a single file. Previously, I tried splitting the data into R1 and R2, which is why the DADA2 denoise step wasn’t working properly due to low read quality. :sweat_smile:

At this point, I’ve been trying to use single-end demux, and I’m wondering how to handle this with DADA2. Should I split the single-end data into multiplexed pairs (which would result in paired-end demux) before denoising with DADA2? Or, should I just run the interleaved demux directly through DADA2? The issue with interleaved is that it takes quite a bit of time to develop the denoised files.

I’m not sure how to proceed and would appreciate any advice on the best practice for handling this.

1 Like

Yes, this is what I would do! While this split may take some extra time and disk, this is a well supported format within the Qiime2 ecosystem. (Some programs have great support for interleaved data, others do not!)

I was not sure if the Qiime2 plugin for dada2 supports this, so the plugin may process these as single-end reads, causing big problems!

Oh.. ok.. thanks..

I had use bio seqio to split the data into R1 and R2 for both set of data and using trimmomatic to do the trimming.. based on the fastqc, my ITS seem to be at stagnant for the per base sequence @ 30. the data as followed

ITS

Measure Value
Filename SRR32378015_ITS.fastq.gz
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 400134
Total Bases 120 Mbp
Sequences flagged as poor quality 0
Sequence length 300
%GC 48

and for 16s

Measure Value
Filename SRR32586856_16s.fastq.gz
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 211158
Total Bases 51.9 Mbp
Sequences flagged as poor quality 0
Sequence length 241-251
%GC 57

so for the next step will be trimming i have this code

def get_dynamic_trimmomatic_settings(file_type="16S"):
"""
Returns dynamic trimming parameters based on file type (16S or ITS).
:param file_type: Type of data (e.g., '16S' or 'ITS')
"""
if file_type == "16S":
# For 16S data, we can be more lenient
return "LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:150"
elif file_type == "ITS":
# For ITS data, we apply more stringent trimming due to lower quality
return "LEADING:3 TRAILING:3 SLIDINGWINDOW:4:10 MINLEN:150"
else:
# Default settings for any unclassified data type
return "LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:150"

and for denoising i have the following code

# Construct the DADA2 command
dada2_cmd = [
    "qiime", "dada2", "denoise-paired",
    "--i-demultiplexed-seqs", input_qza,
    "--p-trunc-len-f", str(trunc_len_f),  # Truncation length for forward reads
    "--p-trunc-len-r", str(trunc_len_r),  # Truncation length for reverse reads
    "--p-max-ee-f", str(max_ee_f),  # Max expected errors for forward
    "--p-max-ee-r", str(max_ee_r),  # Max expected errors for reverse
    "--p-trunc-q", "2",  # Correct option: Truncation quality score
    "--p-min-overlap", "12",  # Correct option: Minimum overlap between reads
    "--o-representative-sequences", output_denoised_qza,
    "--o-table", os.path.join(data_dir, f"{dataset_label}_table.qza"),
    "--o-denoising-stats", os.path.join(data_dir, f"{dataset_label}_denoising_stats.qza")
]

and

Run DADA2 denoise for 16S & ITS (paired data)

def process_denoise():
# Check if the input QZA files for 16S and ITS exist
if os.path.exists(output_qza_16S):
run_dada2_denoise(output_qza_16S, output_denoised_qza_16S, "16S", trunc_len_f="150", trunc_len_r="150",
max_ee_f="2", max_ee_r="2")

if os.path.exists(output_qza_ITS):
    run_dada2_denoise(output_qza_ITS, output_denoised_qza_ITS, "ITS", trunc_len_f="240", trunc_len_r="240",
                      max_ee_f="3", max_ee_r="3")

Not sure if im doing correctly on the method..

1 Like

Thank you for sharing your code! Are you looking for a general code review or is there a specific part you have more questions about?

Qiime2 does not have a single SOP so many of these choices are left up to you!
The overview tutorial shows common workflows, which I often use as a starting point.

thanks.. appreciate that.. anyway i manage to generate the full ASV table.. however..there only 1 sample with a value in the sample.

not sure why there is only 1 value in the sample instead of multiple sample with different value..

could be the feature table or during assigning taxonomy?

please enlighten me if anything goes wrong..

I guess it depends on how you made that table!

The taxonomy classification tables look a bit like that, and they only include taxonomy information and not per-sample counts.

It's possible to post qiime2 artifact file on the forums, of you can run
qiime tools peek file.qza
to show it's type. Because Qiime2 artifacts use well defined data structures, we can tell exactly what should be inside each file by it's type alone!

Ic ... since the qiime2 artifacts are developed based on the metadata.tsv. wondering if it could be the issue? As I only saw two lines of metadata for the interleave data. :sweat_smile:

could it be the reason how i developed the metadata.tsv in the code?
generate_metadata.py (4.2 KB)

anyway, the two files which I used are not relevant to each other as am only using it to test my pipeline. wondering if there are any source i can download for both 16s and ITS on soil analysis.

=)

Perhaps! Can you tell me more about what you are looking for? Like positive control samples with a known composition for testing out your pipeline?

am looking at two illumina sample for soil on both 16s and ITS. wondering if i can get to download the fastq file

Try GitHub - caporaso-lab/mockrobiota: A public resource for microbiome bioinformatics benchmarking using artificially constructed (i.e., mock) communities.

thanks Colin..

anyway.. I had generate a new metadata.tsv as shown in attached.
metadata.tsv (85.5 KB)

when i was generating the manifest.tsv file, im facing issue with absolute file path.
manifest-16S.tsv (507.0 KB)
manifest-ITS.tsv (11.2 KB)

with the error code
:x: QIIME2 Import for 16S failed:
There was a problem importing /home/colin/pycharm_projects/FarmXseed_respo/data/manifest-16S.tsv:

/home/colin/pycharm_projects/FarmXseed_respo/data/manifest-16S.tsv is not a(n) PairedEndFastqManifestPhred33V2 file:

Filepath on line 1 and column "forward-absolute-filepath" could not be found (/home/colin/pycharm_projects/FarmXseed_respo/data/processed_data/SRR32586856_537_R1_trimmed.fastq.gz) for sample "SRR32586856_537".

wondering if im correctly created the metadata.tsv. since Im using interleave data and split it into R1 and R2, the sample ID are generated based on the read in thosed R1 and R2.

However, issue is i dont seem to be able to import my manifest files into demux pair to the absolute file path. either it not been found or it has been registered.

sorry for asking so much question as it took me 2 days to resolve this issue to generate a multiple samples in ASV table :expressionless:

Building a pipeline takes time! I've been there too!

This is a classic file path issue. (There's a different error for invalid metadata!)

# view folder
ls -alh /home/colin/pycharm_projects/FarmXseed_respo/data/processed_data/

# view file
gzip -dc /home/colin/pycharm_projects/FarmXseed_respo/data/processed_data/SRR32586856_537_R1_trimmed.fastq.gz | head

Hmm.. ok.. because i generally created my metadata.tsv by extracting from 16s and ITS trimmed fastq files and i got the sample ID based on my R1 and R2. however, the issue I mentioned earlier is the absolute file path which cause the issue. As you can see from the picture..


This is where I got stuck were the absoluted file path is create base on my sample ID which does not exist.
manifest-16S.tsv (507.0 KB)
manifest-ITS.tsv (11.2 KB)

appreciate if you can share with me how can i proceed from here. you can refer to my code for your reference..
import_qiime2.py (5.6 KB)
Thanks in advance =)

I think I see the issue:
In the gzip file path, its says SR0000_537_R1_trimmed.fastq.gz
while the files in the folder says SR0000_R1_trimmed.fastq.gz

yup correct. Just a question.. so is the full ASV table based on the sample ID in the fastq file to be display all the sample in it. Or i might be wrong on the concept. sorry to ask this as Im pretty new to this bioinfomatics industry..

or the real ASV table is based on ASV_ID: Each ASV (Amplicon Sequence Variant) is uniquely identified by an ID, typically a sequence hash or unique identifier (e.g., cf5520c79, 454e2afd9). This ensures that each ASV is distinguished based on its unique sequence?

Good question!

I often pull data from the PD-mouse tutorial for examples.

The feature table has counts of features in samples. Samples have IDs and features also have IDs. This is also called a crosstab or contingency table. It's a dataframe in the wide format.

The sample IDs made when the data is imported. It's common for the sample ID to be in the file name, but it does not have to be!

These days, features are usually ASVs and the feature IDs are often the md5 hash of the sequence, but it could be asv_1 or otu_1 or whatever.

Observability is important here. Did you know each qiime2 .qza and .qzv file is just a .zip archive? Download some files from PD-mice and open them up. You can see the sample IDs and ASV IDs for yourself.

There's a lot of moving pieces in bioinformatics, so let us know if you have more questions. Feel free to share a little more about your background as well!

oh thanks.. i briefly read through.. so in the mouse tutorial, it basically have numerous fastq file which are require to generate the metadata with different sample id based on each files?

due to my current situation. i only have 1 file for 16s and ITS, so technically i will only have 2 sample ID in my asv table?

am i correct to say that?

Correct!

n=2 is prefect for testing and dev but too small to do statistics. I imagine your pipeline may process more samples later!