I have human nose swab samples. My samples were pooled before sequencing (16S and 18S together). My data are fastq.gz samples (R1 and R2) for each of my samples.
I do the following:
remove my adapters (using cutadapt)
split my data in 16S and 18S based on the primers and remove the primers (with cutadapt)
using a manifest file I create .qza files with my data
following I look at my quality and decide to trunc. I only trunc my reverse reads of the 16S as the others look ok (see picture).
When I look at my denoising statistics, I see that I only keep around 3% of my input and I don't understand why (see picture). Can someone help me?
FWD_ADAPTER="TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG"
REV_ADAPTER="GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG"
# Zoek alle forward reads (_R1.fastq.gz) in RUN205 en submappen
find "$INPUT_DIR" -type f -name "*_R1.fastq.gz" | while read FORWARD; do
# Vind het bijbehorende reverse read-bestand (_R2.fastq.gz)
REVERSE="${FORWARD/_R1.fastq.gz/_R2.fastq.gz}"
# Controleer of het reverse bestand bestaat
if [ -f "$REVERSE" ]; then
BASENAME=$(basename "$FORWARD" _R1.fastq.gz)
echo "Verwerken: $BASENAME"
# Cutadapt uitvoeren voor adapter trimming
cutadapt \
-a "$FWD_ADAPTER" -A "$REV_ADAPTER" \
-o "$OUTPUT_DIR/${BASENAME}_trimmed_R1.fastq.gz" \
-p "$OUTPUT_DIR/${BASENAME}_trimmed_R2.fastq.gz" \
"$FORWARD" "$REVERSE"
echo "Klaar: $BASENAME"
else
echo "Geen bijbehorende R2 gevonden voor $FORWARD! Overslaan..."
fi
done
splitting:
find "$INPUT_DIR" -type f -name "*R1.fastq.gz" | while read FORWARD; do
# Vind het corresponderende reverse bestand
REVERSE="${FORWARD/R1.fastq.gz/R2.fastq.gz}"
# Controleer of het reverse bestand bestaat
if [ -f "$REVERSE" ]; then
# Extracteer de basisnaam zonder subpad
BASENAME=$(basename "$FORWARD" _R1.fastq.gz | sed 's/_trimmed//')
echo "Processing $BASENAME..."
# Run Cutadapt voor 16S
cutadapt \
-g "CCTACGGGNGGCWGCAG" \
-G "GACTACHVGGGTATCTAATCC" \
-o "$OUTPUT_DIR/${BASENAME}_16S_R1.fastq.gz" \
-p "$OUTPUT_DIR/${BASENAME}_16S_R2.fastq.gz" \
"$FORWARD" "$REVERSE"
# Run Cutadapt voor 18S
cutadapt \
-g "CAGCAGCCGCGGTAATTCC" \
-G "CCCGTGTTGAGTCAAATTAAGC" \
-o "$OUTPUT_DIR/${BASENAME}_18S_R1.fastq.gz" \
-p "$OUTPUT_DIR/${BASENAME}_18S_R2.fastq.gz" \
"$FORWARD" "$REVERSE"
else
echo "Waarschuwing: Geen bijbehorende R2 voor $FORWARD gevonden!"
fi
done
This is basically exactly what I would have done given those quality plots, so at this point all I can recommend is setting your trunc-len a bit more aggressively to see if we can get through filtering.
You might try 280 on all of them just to see? I don't love this because again, your quality scores look great.
Otherwise, we need to consider tuning parameters like max-ee.
Hello!
I had a dataset with similar issue and similar quality plots (the same drop in the middle). So the workarounds were for me (choose any that works):
My guess is that you are losing the vast majority of your 16S reads due to the interaction between the sequence length after trimming the primers and your reverse truncation length. It looks like you have 300bp sequences, and your reverse 16S primer is 21 nts long, so if it present on the reads and being trimmed off by cutadapt, then the output reads are 279 bps, and so will get thrown away when you set --p-trunc-len-r 280. The truncation length in dada2 both truncates longer reads to the specified length, and throws away all reads that are shorter than the specified length.
Are you alse seeing this in the 18S data, where you aren't truncating?