Hello @MJ_Estrella
Before delving into code, it's important to note that ITS sequences have a much larger range of lengths than 16S sequences and that your read length may be greater than your amplicons. A brief discussion of this can be found here.
From the looks of it you are sequencing ITS2.
Because there is a possibility that your sequences may be shorter than the read length, you shouldn't use (or want to be very careful using) the --p-minimum-length
parameter. Furthermore, because your reverse primer, indices, and adapter may be included in those sequences you will also want to look for the reverse complement of your reverse primer. If these portions are left in the sequences, feature classification gets funky, as you're including non-biological information. Depending on the quality of your sequencing data, you'll want to adjust --p-error-rate
, --p-no-indels
, --p-no-match-read-wildcards
, --p-overlap
, and --p-discard-untrimmed
. At a minimum, I like to include --p-discard-untrimmed
. To get a sense of how the various parameters affect your data, you'll need to compare the fastq counts of your demultiplexed raw samples to the fastq counts of your samples after adapter removal.
qiime cutadapt trim-single
--i-demultiplexed-sequences demux-filtered-trim.qza
--p-front GTGAATCATCGAATCTTTGAA #your forward primer
--p-adapter GCATATCAATAAGCGGAGGA #the reverse complement of your reverse primer
--o-trimmed-sequences noadapters-all.qza
Regarding sequence clustering, you shouldn't need to use both dada2
and vsearch
. dada2
groups your sequences into ASVs (amplicon sequence variants) after correcting your reads. It would be redundant and could convolute taxonomic assignment -especially for ITS analysis- to further cluster your ASVs into OTUs (operational taxonomic units). The only case I could see this being relevant would be to group sequences at 100% identity to prevent variations in sequence length introduced by adapter trimming and quality truncation from inflating ASVs, but I'm not sure if this is regularly done.
I know that if I use closed-reference I won't have the problem with Unassigned sequences but I need to understand why it happens.
Closed reference OTU assignment discards sequences that are not identified in the database. Thus your unidentified sequences may still be biologically relevant, but you'd just discard them by doing closed-reference OTU assignment. Also it would be redundant to do classify-consensus-blast
after cluster-features-open-reference
using the same database.
I would redo cutadapt
and pick a lower identity threshold for sequence clustering to start. You could also consider using another classifier such as classify-sklearn
to identify unknown OTUs with a trained classifier. You may want to take another look at the ITS analysis tutorial.
Hope this helps