I started using Qiime2 (2021.11) a couple of months ago, I'm working on ITS2 (variable target length) V2*300 MiSeq and completed the analysis up to barplots and tree creation and somehow I was having ~950 sequences at the end while there are 25-30 species. I went back to the first step of analysis i.e. cutadapt and found most of the sequences contain either Fwd_primer or a complement of Rev_primer. I looked a lot for finding online help from this forum and I tried various codes but it didn't help much. I'd greatly appreciate any help. Thanks
Below are the various codes I used:
nohup cat CS_complete/CScomple_libnames | while read line; do cutadapt -g ACGTCTGGTTCAGGGTTGTT -n 3 -m 1 -G TTAGTTTCTTTTCCTCCGCT -n 3 -m 1 -o TrimmedCS1/$line"_R1_trim.fastq.gz" -p TrimmedCS1/$line"_R2_trim.fastq.gz" CS_complete/$line"_R1.fastq" CS_complete/$line"_R2.fastq"; done > out-V1-10Jcss.txt>out-V1-10Jcss.err
qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' --input-path CScomplete-manifest.txt --output-path CS_wf1/demux-trimmed.qza --input-format PairedEndFastqManifestPhred33V2
After all of the above, the primers were still there in >80% seqs; then I found a code below but after using it, there were no sequences left, all were discarded
qiime cutadapt trim-paired --p-cores 16 --i-demultiplexed-sequences CS_wf1/demux.qza --p-adapter-f ACGTCTGGTTCAGGGTTGTT --p-front-f AGCGGAGGAAAAGAAACTAA --p-adapter-r TTAGTTTCTTTTCCTCCGCT --p-front-r AACAACCCTGAACCAGACGT --p-overlap 3 --p-error-rate 0.3 --p-match-read-wildcards --p-match-adapter-wildcards --p-discard-untrimmed --o-trimmed-sequences CS_wf1/trimmed-demuxv1.qza --verbose > cutadapt-log-2novi-cssv1.txt
Just to clarify, you got ~950 after performing the 'qiime2 cutadapt trim-paired' command?
A possible explanation could be because your primer sequences are not in your dataset and the '--p-discard-untrimmed' option you using (good choice to use it!) is letting pass only the few sequences with a random match.
I would suggest to double check your primer sequences, maybe asking with your sequencing provider or who did the first amplification for the library preparation.
Many thanks @llenzi for a quick response.
After cutadapt step there were >4000 seqs, 950 left after downstream analysis (lenghth and class based filtering, seqs depth etc.) and most of the filtered out seqs were also having primers.
The sequencing facility didn’t remove the primers as I can still map both fwd and rev primers in almost all seqs before cutadapt step.
How many samples you have in total? I am still convinced that the sequence you are using are not the correct primer you looking for. These sequences match for ITS2 in nemathodes, is this the species you are looking for?
Thanks again @llenzi
Yes, the primers are for nematodes and the ITS2 length varies from 267– 380bps among various species. There were 227 samples , but after all filtering and screening steps the remaining 950 seqs were from <200 samples. There were few seqs still having primers in set of 950 but as I mentioned above, there were>80% seqs with primers after cutadapt step and number of seqs with primers varies when using code 1 or code 2 while no sequence left when using code 3 written above.
Thanks for your ongoing support.
sorry I coul dnot follow up yesterday and thanks to @SoilRotifer to join!
I am not familiar with these primer/region, do you know what is the expexcted region lengh on average?
More precisely, would you expect the sequences read over the primer in the opposite direction?
Did you ever used these setting before with cutadapt wtih your sequences? In particular I noticed you change the number of time cutadapt look for a primer (you set up tp 3), as well as allowing more relaxed search (error rate 0.3), and in one case, searching anywere in the read rather then in the initial region.
I ha a quick look at the cutadapt logs, it looks as many reads show the primer and pass the initial filters, however very few were actually written in the final output (~15%), I think because they became to short after removing the primer so many times.
Did you try the trimming using the default overlap and error rate?
Can you pass the log for these also, please?
My apologies for the delayed response. The targetted length is variable and I think that is the main issue here. In my samples, I am expecting various species of nematodes and all of them have different targeted ITS2 lengths (between 287 to 380 bps) with using the above sets of primers. I have not used the above settings with cudadapt before, except option #1, and after using option 1, I still got primers and common overhang seqs (adapters) in some of the seqs, then I found options 2 & 3 from this forum and used them. I have attached a consensus sequence assembled from 2 selected reads (Fwd and Rev) from a sample and mapped primers and adapter if that helps to understand
Also, I just used the below command after certain attempts and it seems worked but after the denoising step, 216/227 samples were left and feature frequency also reduced (image provided below) significantly :(. Can you please also have a look at the verbose output?