I'm using cutadapt on 4 datasets to trim from v3-v4 regions, and keep only V4
All the data sets are in single-end format using flash.
From the 4 datasets, the 3 are normally trimmed based on the primer GTGYCAGCMGCCGCCGCGGTAA (515)
However, when I used the primer again, 1 data set, did not find it and every time I used the 5end it returned only 5, 10, 3 sequenced that have this primer and so on. I used different parameters and again no luck.
But from a paper doi: 10.1128/mSphere.01202-20, I found this primer GACTACHVGGGTATCTAATCC, which is the reverse primer when sequencing v3-v4.
And normally in my 5 end it recognized the primer and returned sequences that it has recognized 112365 times e.g. in each sample. S o it works!!!
Is it correct to assume that the sequences I have now are from the V4 region?
you have four sets of paired end reads, each of which has been merged and now you have four sets of merged reads
you are using a single primer that matches in the middle of the sequence to extract the v4 region and you want to keep everything downstream (i.e. towards the 3' end)
this is working as you expected for three of the sets but is failing for the fourth
Is this all correct?
Were these four sets of reads generated using precisely the same primer sets? Were they merged using precisely the same parameters to the same merging algorithm/software?
Yes, i have 4 datasets, all datasets have v_region v3_v4 paired end reads, and all of them have contract with different protocols, different sequence technology like Miseq, Hiseq and so on.
All dataset merged with flash separately.
Then i use this primer GTGYCAGCMGCCGCCGCGGTAA (515), to keep the V4 region for all datasets. And it is work for 3 datasets.
But for one dataset, this primer don't recognized in 5 end of sequence.
Specially this dataset generated with:
The V3-V4 hypervariable region of the bacterial 16S ribosomal RNA (rRNA) gene was amplified from the DNA samples with the barcoded forward primers 341F (50 -CCTACGGGNBGCASCAG-30 ) and the reverse primers806R (50 -GGACTACNVGGGTWTCTAAT-30 ) using KAPA
HIFI HotStart ReadyMix (KAPA Biosystems, United States).
So when i use a reverse primer for v4, recognize it at 5 end of sequence. So i dont know what happen.
for fastq in "input_directory"/*.fastq;
do
cut=(basename "$fastq")
output_file="$output_directory/$cut"
cutadapt -a "$reverse" -g "$forward" -o "$output_file" "$fastq"
done
And this is a demux.qzv from one of those dasasets after i had use the cutadapt outside of qiime2
I would recommend using qiime2 to perform the trimming, merging, and demultiplexing, this is the only way for us to really be able to provide useful support--by seeing the provenance (history) of the things you've done with your data, and looking for clues that could explain the results you're seeing.
Given that your goal is to eventually extract the V4 region from all of your amplicons I think you should probably:
run dada2 on each dataset separately
use feature-classifier extract-reads to extract your region of interest
use vsearch cluster-features-de-novo with a --p-perc-identity of 1 to update your feature table with the newly extracted sequences
I'm a little confused about the demux visualizations you've uploaded--you should have four of these, correct? Have you been uploading the problematic one only?
I would begin by running dada2 denoise-paired on this demux (and all others), and then looking at the dada2 stats output as a first troubleshooting step.