Unable to map to 12S reference database

Running Qiime 24.5

Hello,
I'm attempting to map my reads to a reference database I created from sourcing 12S barcoded fish sequences. It works quite well but when I go to make my barplot, there are several chunks that are only able to map to the family/genus level. However, the ASV's that remain classified at the family/genus level BLAST 99-100% to known species, and those species and their barcodes are in my reference database. Why is Qiime failing to map to a sequence that is over 99% identical to, and leaving these reads at the family or genus level?

Commands used:
#For importing - I trimmed low quality reads and merged my paired end reads in GENEIOUS before importing to Qiime, hence the single end

qiime tools import
--input-path /Users/dakotabetz/Desktop/eDNA_AllDates/MARCH.tsv
--output-path fj-joined-demux2.qza
--type 'SampleData[SequencesWithQuality]'
--input-format SingleEndFastqManifestPhred33V2

#For denoising

qiime dada2 denoise-single
--i-demultiplexed-seqs /Users/dakotabetz/Desktop/eDNA_forward/fj-joined-demux2.qza
--p-trunc-len 180
--o-table table-dada180.qza
--o-representative-sequences rep-seqs-dada180.qza
--o-denoising-stats denoising-stats-dada180.qza

#for taxonomy

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads ref-seqsGENBANKDakota222.qza
--i-reference-taxonomy ref-taxonomyGENBANKDakotaFinal.qza
--o-classifier ref-classifierDakota.qza

qiime feature-classifier classify-sklearn
--i-classifier ref-classifierDakota.qza
--i-reads rep-seqs-dada180.qza
--o-classification taxonomyDakotaNEWNEW.qza

qiime taxa barplot
--i-table table-dada180.qza
--i-taxonomy taxonomyDakotaFInal2.qza
--o-visualization taxa-bar-plotsDakotaNEWFinal2.qzv

Hi @Dakota_Betz ,

Welcome to the forum!

Good question. The short answer: Because this is exactly how this classifier is intended to work when there are multiple close hits in the database. The classifier functions by estimating "confidence" of assignment for each possible taxon (as opposed to mapping to the closest match). This confidence score is based on similarity of kmer profiles between your query sequence and the reference sequences. So a low-confidence hit (resulting in genus or family or lower-resolution classification) will occur when the kmer profile matches multiple different clades. This can be because many clades have similar kmer profiles, or due to misannotated reference data, or also low-quality query sequences (e.g., if there are sequence errors or ambiguous nucleotides). When the confidence score for the top hit is less than the threshold set by the confidence parameter (0.7 by default), no result is reported for that taxonomic rank, and the procedure repeats at the next highest rank (so, e.g., if no confident species hit is found, the process is repeated at genus level, etc)

You can read more about the confidence score in the corresponding article about this method (see qiime feature-classifier --citations)

But they probably also align with high % identity to multiple species/clades. So just because the top hit has a high % id does not mean that this is a confident match.

If you really want to operate in a way similar to NCBI BLAST and take the top hit, it is possible to disable the confidence threshold parameter (see the help documentation for more details). This will then report whichever species has the highest confidence score. But will you be confident in such a result? :thinking:

I hope that helps clarify!

3 Likes

Hi Nicholas,
Thank you for your very helpful reply. So is there anyway to further classify these family level identities, or would it be best to just add blast results for the ASV's binned in these "__Family" classifications?

You could try adjusting the confidence setting. This was optimized for 16S rRNA gene data. 12S rRNA might have different characteristics and a lower confidence score may be more appropriate. However, you would need a groundtruth to validate this (the RESCRIPt plugin has some actions that can automate simple testing)

But I would also try cleaning up the database a bit. I suspect that misannotated or incompletely annotated reference sequences may also be present, which could create this issue.

I would discourage this for the same reasons stated above. Just because your query has a 99-100% match to a reference sequence does not make that correct, as there may be multiple top hits.

Good luck!

1 Like