Elevated Unassigned sequences with ITS data

Hello everyone,

I have a problem with my ITS data. When I analyzed the data I got a lot of unassigned sequences (more than 50%) and I don't know where is the problem. This is my current code but I have changed it a lot looking for an answer for this issue (I haven't found anything).

qiime tools import
--type 'SampleData[SequencesWithQuality]'
--input-path se-33-manifest-mine-H
--output-path single-end-demux-H.qza
--input-format SingleEndFastqManifestPhred33

qiime quality-filter q-score
--i-demux single-end-demux-H.qza
--p-min-quality 19
--o-filtered-sequences demux-filtered-trim.qza
--o-filter-stats demux-filter-stats-trim.qza

I think that here there is the big problem, because when I use cat for forward and reverse only I can put 1 primer...

qiime cutadapt trim-single
--i-demultiplexed-sequences demux-filtered-trim.qza
--p-minimum-length 250
--p-error-rate 0
--o-trimmed-sequences noadapters-all.qza

qiime dada2 denoise-single
--i-demultiplexed-seqs noadapters-all.qza
--p-trunc-len 0
--p-max-ee 2
--p-n-threads 0
--o-table table-demux-trim.qza
--o-representative-sequences rep-seq-trim.qza
--o-denoising-stats denoising-demux-ytom.qza
--p-chimera-method pooled

I know that if I use closed-reference I won't have the problem with Unassigned sequences but I need to understand why it happens.

qiime vsearch cluster-features-open-reference
--i-sequences rep-seq.qza
--i-table table-demux.qza
--i-reference-sequences heimer-seq.qza
--p-perc-identity 0.97
--o-clustered-table table-open.qza
--o-clustered-sequences rep-open.qza
--o-new-reference-sequences new-open.qza

qiime feature-classifier classify-consensus-blast
--i-query rep-open.qza
--i-reference-reads heimer-seq.qza
--i-reference-taxonomy heimer-taxonomy.qza
--p-perc-identity 0.97
--o-classification taxonomy-full-open-trim.qza

qiime taxa barplot
--i-table table-open.qza
--i-taxonomy taxonomy-full-open.qza
--m-metadata-file metadata.tab
--o-visualization taxa-bar-plots.qzv

Perhaps the name of a file does not correspond to the previous one... I've tried so many things that it's crazy but in reality I correct it before executing it.

I have tried pairing/simple forward/ cat forward and reverse (I did it with qiime 1 and had no problem). More stuff... Don't delete adapters/delete adapters, use quality filter/do not use it,,,,, and many more things. Maybe this is the problem that I use so many plugins....

I also have a doubt... Let's suppose that I have indeed done it right (I doubt it) and I have many Unassigned sequences. To get the biodiversity there is no problem, I use everything. I think, sometimes too much, that if the taxonomics does not reflect the biodiversity, the interpretation of the data is wrong, isn't it?

Sorry for this long-long post but I have read a lot about this topic (in this forum and others places) but I am not making progress....

1 Like

Hello @MJ_Estrella

Before delving into code, it's important to note that ITS sequences have a much larger range of lengths than 16S sequences and that your read length may be greater than your amplicons. A brief discussion of this can be found here.

From the looks of it you are sequencing ITS2.

Because there is a possibility that your sequences may be shorter than the read length, you shouldn't use (or want to be very careful using) the --p-minimum-length parameter. Furthermore, because your reverse primer, indices, and adapter may be included in those sequences you will also want to look for the reverse complement of your reverse primer. If these portions are left in the sequences, feature classification gets funky, as you're including non-biological information. Depending on the quality of your sequencing data, you'll want to adjust --p-error-rate, --p-no-indels, --p-no-match-read-wildcards, --p-overlap, and --p-discard-untrimmed. At a minimum, I like to include --p-discard-untrimmed. To get a sense of how the various parameters affect your data, you'll need to compare the fastq counts of your demultiplexed raw samples to the fastq counts of your samples after adapter removal.

qiime cutadapt trim-single
--i-demultiplexed-sequences demux-filtered-trim.qza
--p-front GTGAATCATCGAATCTTTGAA #your forward primer
--p-adapter GCATATCAATAAGCGGAGGA #the reverse complement of your reverse primer
--o-trimmed-sequences noadapters-all.qza

Regarding sequence clustering, you shouldn't need to use both dada2 and vsearch. dada2 groups your sequences into ASVs (amplicon sequence variants) after correcting your reads. It would be redundant and could convolute taxonomic assignment -especially for ITS analysis- to further cluster your ASVs into OTUs (operational taxonomic units). The only case I could see this being relevant would be to group sequences at 100% identity to prevent variations in sequence length introduced by adapter trimming and quality truncation from inflating ASVs, but I'm not sure if this is regularly done.

I know that if I use closed-reference I won't have the problem with Unassigned sequences but I need to understand why it happens.

Closed reference OTU assignment discards sequences that are not identified in the database. Thus your unidentified sequences may still be biologically relevant, but you'd just discard them by doing closed-reference OTU assignment. Also it would be redundant to do classify-consensus-blast after cluster-features-open-reference using the same database.

I would redo cutadapt and pick a lower identity threshold for sequence clustering to start. You could also consider using another classifier such as classify-sklearn to identify unknown OTUs with a trained classifier. You may want to take another look at the ITS analysis tutorial.

Hope this helps


Welcome to the forum, @MJ_Estrella !

Just to add some notes to @jsogin574 's excellent advice:

What is this file? The number of unassigned seqs is simply an indication that most do not have matches to your reference sequences, so if your reference sequences are incomplete that would explain it.

This is sort of true, but not entirely; closed-reference OTU clustering (the first step of open-reference clustering) finds the closest match in the reference database and that sequence (and its ID) is taken as the representative sequence in the output. Thus, the reference taxonomy can be used directly as annotations for those full-length reference sequences. However:

  1. sequences that do not hit a reference sequence are subject to de novo OTU clustering, for which the reference taxonomy does not provide a pre-packaged annotation. Taxonomy classification must be performed on these de novo OTUs.
  2. closed-ref OTU clustering just takes the first top hit as the representative, but what if there are multiple top hits? The number of top hits is not reported, so taking the reference sequence and taxonomy as representatives can be a bit presumptuous. The clustered sequences that map to each reference sequence are also output... they are given the reference sequence's ID so you could use the reference taxonomy, but classifying these fragments with q2-feature-classifier will give you a more conservative prediction of the taxonomic group(s) that it could belong to.

@jsogin574 is correct, this is in general the better method for taxonomic classification. But if classify-consensus-blast is reporting so many unknowns, it usually indicates an issue with the database or query sequences and classify-sklearn probably will not do much better. Nevertheless, switching methods is always a good first step to troubleshooting classification issues, to make sure that it is not an issue with the method.

Good luck @MJ_Estrella and thanks for the great post @jsogin574 !


Thanks for your input. Sorry for taking so long to answer but I tried to do the same on two computers (RAM problems) because I couldn't use qiime feature-classifier classify-sklear on my daily computer (naive-bayes I can't use it on any of them but on the forum I saw some classifiers so... no problem).

The file named heimer corresponded to the original database, I just gave it a geeky name for short, sorry for not changing it in the post

This is my taxa-bar plot file. After dada2 use feature-classifier classify-sklear and that's it. But I have the following question. denoising-demux.qzv (1.2 MB)

The majority group is grouped only in fungi has no phylum (k__Fungi;;...) I have read in the forum that you answered something similar about this to a person and you called it junk/non-target DNA, could you just delete these sequences without affecting the rest of the analysis? or should I try a more updated database? taxa-bar-plots-filtered.qzv (349.0 KB)

Some idea?

what is the original database? UNITE? which version?

Do you have non-fungal reference sequences in your database? It is true, often you will get classification of non-target reads at kingdom level if you only have one kingdom represented in the database. E.g., what you have is very likely plant DNA that is classifying as "Fungi" because you do not have plants in your reference database.

Classification otherwise looks good (i.e., you have mostly genus/species classifications of other sequences) so it seems like this is probably non-target DNA (e.g., plant or host DNA).

If they are non-target sequences, yes. But check first. Grab a few unassigned reads and blast them (on NCBI blast) just to be sure. If they are plant/host/non-fungal DNA, remove them and move on!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.