Feature-IDs in Taxonomy File Don't Match Classified Sequences


I'm trying to get VT numbers to line up as feature-IDs in my feature table after training a classifier with Maarjam files. I attached my feature-table and taxonomy files. My code is as follows:

qiime tools import --type 'FeatureData[Taxonomy]'
--input-format HeaderlessTSVTaxonomyFormat
--input-path maarjam-tax-file2.txt
--output-path ref-taxonomy1.qza

qiime tools import
--type 'FeatureData[Sequence]'
--input-path maarjam-seqs.txt
--output-path maarjam-seqs.qza

qiime feature-classifier extract-reads
--i-sequences maarjam-seqs.qza
--o-reads ref-seqs.qza

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads ref-seqs.qza
--i-reference-taxonomy ref-taxonomy1.qza
--o-classifier classifier.qza

qiime feature-classifier classify-sklearn
--i-classifier classifier.qza
--i-reads rep-seqs-dn-97.qza
--o-classification 97datataxonomy.qza

qiime taxa filter-seqs
--i-sequences rep-seqs-dn-97.qza
--i-taxonomy 97datataxonomy.qza
--p-exclude Unassigned
--o-filtered-sequences rep-seqs-dn-97-no-unassigned.qza

qiime taxa filter-table
--i-table table-dn-97.qza
--i-taxonomy 97datataxonomy.qza
--p-exclude Unassigned
--o-filtered-table table-dn-97-no-unassigned.qza

The feature-table is almost exactly what I need, but the identifiers are the random string of characters from the classifier instead of the IDs in the source taxonomy file. I made sure that the IDs from the taxonomy file match the reference sequence file, so that is not the issue. Even if I wanted to manually convert the character-string IDs into the taxonomy IDs, there is no way to effectively do so as the classification in the taxonomy file isn't the same as the representative sequences.

table-dn-97-no-unassigned.qzv (1.3 MB) 97datataxonomy.qzv (1.3 MB) maarjam-tax-file2.txt (26.1 KB)

Hi @lnovak4 ,

It sounds like you want your feature IDs (i.e., of the sequences you observe in your samples) to be relabeled as the closest match in marjaam. This is possible but not the way that you are doing it.

to clarify, these are not from the classifier at all. Nor is the ID random.

These are unique identifiers (md5 hash indices) assigned to each query sequence during denoising, not taxonomy classification.

During classification the original feature ID is retained... the classifier does nothing with the feature IDs.

The classifier compares the query to the reference sequences to the most likely taxonomic classification. Often, the query sequence might match (or be equally similar to) multiple reference sequences, hence a consensus or most confident classification is given. This is why you see many, e.g., phylum-level classifications — most likely, that phylum cannot be fully differentiated using the marker gene that you selected. (either that or these query sequences do not match anything in the database very well, either because they are not AMF, are not fungi, etc)

So what you are doing now is the intended operation, and standard/best practice. But if you really want to do something inadvisable (i.e., lead to false positives/false assumptions about your data) it is possible to simply map your reads to the top hit in marjaam, using closed-reference clustering (see docs.qiime2.org for the OTU clustering tutorial to see how this is done and for more description. See also the overview tutorial for some description of the caveats of this method).

Keep in mind that your query sequences clearly resemble many top hits that belong to different orders, based on the classification results you have shared. So doing closed-reference clustering and taking the top hit's taxonomy and feature ID would give you very very imprecise results. so I would strongly discourage this approach. I recommend sticking with your current approach and you can manually inspect the alignments to see what is going wrong.

Good luck!


Thank you! I will have to manually compare the unique IDs to feature IDs then.

Another question I have: I went through a handful of sequences from the final filtering and cross-referenced them to Maarjam by manually blasting. The blast results were different than the NB results and there were many more unassigned sequences when I manually blasted. I know that Maarjam has the tendency to over-classify things that are not AMF, but is there a reason this is only an issue with the trained classifier?

Yes. The model is trained on whatever sequences you feed it (e.g., only fungal AMF sequences), and looks at kmer frequency profiles (not sequence alignment), so unless if you include outgroups in the reference database it does not know what a non-fungal sequence looks like and can classify some sequences to a basal rank if there is some kmer similarity to sequences in that clade.

On the other hand BLAST uses local alignment with a specific % similarity threshold to rule out sequences that do not align well. (a much slower process, but checks actual % similarity)

With the Naive Bayes classifier it is often best to treat sequences that classify to very shallow ranks (kingdom or phylum) as non-target sequences (e.g., host or other organisms) and discard them... unless if you really expect to find a novel phylum in your samples!


Wonderful, thank you!

I have some final questions: To get rid of non-target sequences, would the best approach be to filter by alignment if I'm using my own classifier, or would it be better to just filter by taxa? If I used alignment, is there a preference to using BLAST or vsearch in my situation?

Thanks so much for your help!

Using an alignment filter would probably be the safest approach. Though to be honest both approaches probably lead to the same outcome (unless if you really do expect to discover novel phyla) — but that's based on my experience with 16S and fungal ITS with the UNITE database... with maarjam the outcome might be different so alignment would be a safer option.

We have just the method for it... check out the documentation at qiime quality-control exclude-seqs --help

qiime quality-control exclude-seqs can use BLAST or vsearch for alignment. I would recommend vsearch if only because it will be faster.

good luck!

1 Like