Issues during denoising with NextSeq 16S data.

eoinoh91 · September 6, 2023, 4:30pm

Hi all,

I've run into an issue with NextSeq data in QIIME2, in that we are losing 75%+ of the sequences during the DADA2 denoising step.
We have 2x300bp sequencing data for the V3-V4 16S region, amplified with the 341F/805R primers.
The target amplicon size should be 464bp, with an insert size of 426bp following primer trimming.
As part of our SOP, I check read quality and trim primers outside of QIIME2.
The reads are of acceptable quality, with the reverse read quality declining toward the end of the read.
The one sample of very poor quality is a negative control. see MultiQC output below:

Here is the interactive quality plot of the demux.qzv object:

The QIIME2 plots look unusual, and I understand this stems from the different quality score binning strategy used in NextSeq and NovaSeq systems vs. the MiSeq.
I know this has been discussed on the QIIME2 and Dada2 forums, with the suggestion to either modify and use the R version of DADA2 directly, or use Deblur which should not be affected by these issues.
I haven't gone down the road of using base R DADA2 yet, as I wanted to see if I could find a way to make this work using q2-dada2, and I'm also not convinced the poor results are just down to the binning issues I've run it multiple times at various different truncation, min overlap and max error thresholds, and the best I've got to is around 20-30% maximum of my reads being retained. The below table was produced using this command (part of a shell script):

TRUNCLENF=273
TRUNCLENR=175

time qiime dada2 denoise-paired --i-demultiplexed-seqs demux.qza \
--p-trunc-len-f $TRUNCLENF \
--p-trunc-len-r $TRUNCLENR \
--p-n-threads 0 \
--o-representative-sequences rep-seqs-$TRUNCLENF-$TRUNCLENR.qza \
--o-table table-$TRUNCLENF-$TRUNCLENR.qza \
--o-denoising-stats stats-dada2-$TRUNCLENF-$TRUNCLENR.qza \
--p-min-overlap 6 \
--p-max-ee-f 4 \
--p-max-ee-r 4 \
--verbose

It seems that a large amount of reads are being lost at the merging and chimera checking steps. I usually just use the quality plots and required overlap to choose truncation sites, but I used the FIGARO tool to select optimum truncation sites here; again I tried a few combinations, with no improvement.
I've also tried using Deblur on merged reads, but the results were no better, and probably worse. I tried various different truncation parameters, but they only slightly improved the outcomes. Around 90% of the reads were being lost.
It's worth noting that when I proceed to classification, the data does look broadly as expected (these are rumen samples), but given that we are losing such a high proportion of the sequence data I don't have confidence it in.
I've copied my full Q2-Dada2 workflow below; it's split into 3 shell scripts. Environmental variables were not rendering correctly on the forum post so they are removed, but the scripts all run correctly.
Sorry for the long post, but hopefully it's clear. Has anyone else run into issues like this using NextSeq (or maybe NovaSeq) data in QIIME2, and found a solution?

TIA! - Eoin

Platform - HPC server w/3TB RAM and 396 cores.
QIIME2 v.2023-5

#!/bin/bash
# Part 1: primer trimming and import

# Set other environmental variables. This might require an exploration of the data prior to running this script. 
R1TRIM=17
R2TRIM=21
NTHREADS=16
MAPFILE=${dir_docs}"/q2_mapping.txt"
CLASSIFIER=${dir_silva_db}"/silva-138-classifier-16S-V3V4.qza"

# 1. Removing the sequencing primers.

echo "####################################################"
echo ""
echo "1. Trimming sequencing primers using BBDUK at `date`"
echo "Trimming `$R1TRIM`nt from the FWD reads and `$R2TRIM`nt from the RVS reads"
echo ""
echo "####################################################"

# First do the forward reads (R1):

cd $dir_fastq_raw

for R1 in *R1*

do

    R1out=${R1//.fastq.gz/_trimmed.fastq.gz}

    bbduk.sh in=$R1 out=$dir_fastq_trimmed/$R1out forcetrimleft=$R1TRIM threads=$NTHREADS

done

# Next do the reverse reads (R2):

for R2 in *R2*

do
R2out=${R2//.fastq.gz/_trimmed.fastq.gz}

bbduk.sh in=$R2 out=$dir_fastq_trimmed/$R2out forcetrimleft=$R2TRIM threads=$NTHREADS

done

echo "###########################################################"
echo ""
echo "Finished trimming sequencing primers using BBDUK at `date`"
echo ""
echo "###########################################################"

# 2. Import reads as QIIME2 object and inspect

# QIIME2 data import and qc
source activate qiime2-2023.5

echo "###########################################################"
echo ""
echo "2. Importing data into QIIME2 at `date`"
echo ""
echo "###########################################################"

#   All further analysis takes place in the "results" directory.
cd $dir_results

time qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' --input-path $MAPFILE --output-path demux.qza --input-format PairedEndFastqManifestPhred33V2

# Visually inspect data

time qiime demux summarize --i-data demux.qza --o-visualization demux.qzv


echo "Finished data import at `date`"

echo "###########################################################"
echo ""
echo "q2_gr2301_v3v4_shell_part1_updated completed at `date`. Inspect the demultiplexed output to inform denoising parameters in part 2."
echo ""
echo "###########################################################"

# Part 2: Denoising

# Set other environmental variables. This might require an exploration of the data prior to running this script. 
TRUNCLENF=273
TRUNCLENR=175
NTHREADS=16
MAPFILE=${dir_docs}"/q2_mapping.txt"
CLASSIFIER=${dir_silva_db}"/silva-138-classifier-16S-V3V4.qza"

cd $dir_results

echo "###########################################################"
echo ""
echo "1. Commencing denoising of data into ASVs using Dada2 at `date`"
echo ""
echo "###########################################################"

time qiime dada2 denoise-paired --i-demultiplexed-seqs demux.qza --p-trunc-len-f $TRUNCLENF --p-trunc-len-r $TRUNCLENR  --p-n-threads 0 --o-representative-sequences rep-seqs-$TRUNCLENF-$TRUNCLENR.qza --o-table table-$TRUNCLENF-$TRUNCLENR.qza --o-denoising-stats stats-dada2-$TRUNCLENF-$TRUNCLENR.qza --p-min-overlap 6  --p-max-ee-f 4 --p-max-ee-r 4 --verbose

# Summarise outputs 
time qiime metadata tabulate --m-input-file stats-dada2-$TRUNCLENF-$TRUNCLENR.qza --o-visualization stats-dada2-$TRUNCLENF-$TRUNCLENR.qzv
time qiime feature-table summarize --i-table table-$TRUNCLENF-$TRUNCLENR.qza --o-visualization table-$TRUNCLENF-$TRUNCLENR.qzv
time qiime feature-table tabulate-seqs --i-data rep-seqs-$TRUNCLENF-$TRUNCLENR.qza --o-visualization rep-seqs-$TRUNCLENF-$TRUNCLENR.qzv


echo "###########################################################"
echo ""
echo "Finished denoising of data into ASVs using Dada2 at `date`"
echo ""
echo "###########################################################"

# Part 3: Taxonomic classification and visualization

# Set other environmental variables. This might require an exploration of the data prior to running this script. 
TRUNCLENF=283
TRUNCLENR=165
NTHREADS=16
MAPFILE=${dir_docs}"/q2_mapping.txt"
CLASSIFIER=${dir_silva_db}"/silva-138-classifier-16S-V3V4.qza"


echo "###########################################################"
echo ""
echo "1. Building phylogenetic trees using MAAFT at `date`"
echo ""
echo "###########################################################"


time qiime phylogeny align-to-tree-mafft-fasttree --i-sequences  rep-seqs.qza --o-alignment aligned-rep-seqs.qza --o-masked-alignment masked-aligned-rep-seqs.qza --o-tree unrooted-tree.qza --o-rooted-tree rooted-tree.qza

echo "Finished building phylogenetic trees using MAAFT at `date`"


echo "###########################################################"
echo ""
echo "2. Commencing taxonomic classification with SILVA at `date`"
echo ""
echo "###########################################################"

time qiime feature-classifier classify-sklearn --i-classifier $CLASSIFIER --i-reads rep-seqs.qza --p-n-jobs $NTHREADS --o-classification asv-w-taxonomy-dada2.qza --verbose

echo "Taxonomic classification with SILVA finished  at `date`"

echo "###########################################################"
echo ""
echo "3. Began building barcharts at `date`"
echo ""
echo "###########################################################"

time qiime metadata tabulate --m-input-file asv-w-taxonomy-dada2.qza --o-visualization asv-w-taxonomy-dada2.qzv
 
time qiime taxa barplot --i-table table.qza --i-taxonomy asv-w-taxonomy-dada2.qza --m-metadata-file $MAPFILE --o-visualization taxa-bar-plots.qzv


echo "###########################################################"
echo ""
echo "QIIME2 analysis finished at `date`. Results written to $dir_results. Error and output written to $dir_docs"
echo ""
echo "###########################################################"

timanix · September 7, 2023, 8:04am

Welcome to the forum!

First of all, I remember reading your paper regarding rumen microbiome and A. taxiformis application. It is a good paper, congrats!

I don't think that you are facing issues because of NextSeq binning, looks like the root of the problem is somewhere else.##

V3-V4 HV region is quite variable in length, so my first suggestion will be to adjust truncation parameters. Could you try to set F as 260 and R as 230?

Did you try to skip any QC step before dada2? I guess it can affect error learning step in dada2. I would also try to trim the primers with Cutadapt inside of qiime2 with discarding any sequences with no found primers in it.

Hopefully, all these steps above will increase merging rate and lower chimeras in your reads.

Best,

system · October 8, 2023, 2:05pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.