qiime2 issue with unassigned reads

Hi, I am running version qiime2-2021.11 that was conda installed

I have 176 paired-end samples that are illumina processed and are Casava formatted files. I am attempting to look at fungal communities in plant root and soil samples. I am training my own classifier using the UNITE databases and my primers are f/GITS7 and ITS4.
The problem I keep running into is that when I attempt to assign taxonomy all but two of my sample features come back as unidentified. I have been troubleshooting this problem for the last week and have tried many of the fixes recommended to others who have posted about a similar problem.


The code I've been running is this:
IMPORT DATA:
qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path RSD
--input-format CasavaOneEightSingleLanePerSampleDirFmt
--output-path demux-paired-end.qza

DEMULTIPLEX: samples are already demultiplexed and only need to be summarized and viewed

Quality filter and cluster the data into ESVs:

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-paired-end.qza \
--p-trunc-len-l 265 \
--p-trunc-len-r 233
--o-table table.qza \
--o-representative-sequences rep-seqs.qza
--o-denoising-stats denoising-stats.qza \
--p-n-threads 20 \

SEE DATA DISTRIBUTION:

qiime feature-table summarize \
--i-table table.qza
--o-visualization table.qzv
--m-sample-metadata-file MetadataRGP.tsv

qiime feature-table tabulate-seqs
--i-data rep-seqs.qza
--o-visualization rep-seqs.qzv

qiime metadata tabulate
--m-input-file stats-dada2.qza
--o-visualization stats-dada2.qzv

ASSIGN TAXONOMY:
qiime feature-classifier extract-reads
--i-sequences sh_refs_qiime_ver8_99_s_10.05.2021.qza
--p-f-primer GTGAATCATCGAATCTTTG
--p-r-primer TCCTCCGCTTATTGATATGC
--o-reads ref-seqs.qza

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads ref-seqs.qza
--i-reference-taxonomy sh_taxonomy_qiime_ver8_99_s_10.05.2021.qza
--o-classifier classifier.qza

qiime feature-classifier classify-sklearn
--i-classifier classifier.qza
--i-reads rep-seqs-dada2.qza
--o-classification taxonomy.qza

qiime metadata tabulate
--m-input-file taxonomy.qza
--o-visualization taxonomy.qzv

I've tried multiple variations of this with adding steps or other commands such as trying the Classify-consensus-vsearch and ended with the same results. This is my first projects I've used QIIME2 for, I really appreciate any help thank you!

Hi @RissaGP and welcome to the forum!

One notable clue is that the "unidentified" annotation is present in the reference taxonomy itself, so you are basically getting a species-level hit but to a sequence (or sequences) that have useless annotation(s). q2-feature-classifier would just return blank annotations or "Unclassified" if it were failing to classify for some reason.

So we can eliminate the classification step from troubleshooting. The issue is clearly with the input database.

On the one hand, we already know that the UNITE+INSDC database contains some bad annotations (and misannotated sequences). Removing these from the database, along with any abnormally short sequences, should improve classification to some extent, and we show this with UNITE here (and RESCRIPt is the QIIME 2 plugin that can also be used for filtering these out, see the examples linked in this article but also the tutorial on the forum):

On the other hand, we also recommend not trimming UNITE with primer sequences using the extract-reads method, as this can degrade performance if the reference sequences already have the primers removed. See the note here:
https://docs.qiime2.org/2022.2/tutorials/feature-classifier/#classification-of-fungal-its-sequences

As a starting point, I would suggest seeing how many sequences you have before and after trimming, and inspecting the sequence length profiles (see RESCRIPt for a visualizer that give you the counts, lengths, and length histogram). You would expect to lose many seqs (as many will be ITS1 only) but if the majority is lost this is probably unexpected (e.g., because the primers are trimmed from the reference sequences). If you do this, could you share your QZVs of before/after trimming? I would be curious to see how many are lost as I have not tested this primer set before.

Ultimately I think the thing to do will be to just drop the extract-reads step and train your classifier on the full INSDC database (after filtering abnormally short sequences and bad annotations).

Good luck!

2 Likes

Hi @Nicholas_Bokulich I really appreciate your advise!

What I ended up doing was using the dev UNITE database which I then filtered using RESCRIPt, I did not trim the dev UNITE database to my primer sequences and it did end up identifying the majority of the samples!
classified-metadata.tsv (1.7 MB)

Thank you for all your help!

1 Like