How to resolve missing OTUs in Qiime2 analysis despite finding through BLAST?

qiime version: qiime2-2022.11
conda environment

Hi everyone,

Starting from the sequences obtained from an Illumina MiSeq 300x2 paired end run, I performed an analysis on eDNA using qiime2 and the 16s primer for invertebrates. I also performed the same analysis using another software (OBItools3) on the same dataset.
What I noticed is that:

  1. the results from qiime2 are more "abundant" than those from obi, meaning I am able to obtain a greater number of identified OTUs.
  2. the results from obitools, in terms of sequence quantity, are one-fifth of those from qiime2.

However, a problem emerged: a key OTU in my dataset (Chamelea gallina, which was present in the obi results) never appears in the qiime analysis. This is strange because I am certain that the sequence of that species is present in my dataset, and it is also quite abundant.

To verify this, I extracted the unassigned sequences from my dataset after denoising with dada2:

qiime taxa filter-seqs
--i-sequences rep-seqs.qza
--i-taxonomy Vsearch-taxonomy.qza
--p-include "Unassigned"
--o-filtered-sequences unassigned_seqs.qza

I then blasted (blastn 2.13.0+) a Chamelea query obtained from the obi pipeline using the unassigned sequences as a reference database. The result is a 100% match with no gaps and multiple matches. Furthermore, I verified that the sequence of the OTU was present in the reference database used for taxonomic identification.

Starting from the denoised sequences, my commands were:

qiime dada2 denoise-paired
--i-demultiplexed-seqs seqs.qza
--p-trim-left-f 33
--p-trim-left-r 33
--p-trunc-len-f 150
--p-trunc-len-r 150
--p-max-ee-f 2.0
--p-max-ee-r 2.0
--p-chimera-method 'consensus'
--p-n-reads-learn 1000000
--p-n-threads 7
--o-table table.qza
--o-representative-sequences rep-seqs-150f-150r-miseq.qza
--o-denoising-stats denoising-stats-miseq.qza

The value of '--p-trunc-len' is set to 150 to compare this dataset with another Illumina 150x2 paired-end dataset. In any case, the sequences of interest fall within the 300bp range. I performed this step on a subset by setting all possible values, and 150 should not qualitatively influence the results.

qiime feature-classifier classify-consensus-vsearch
--i-query rep-seqs-150f-150r-miseq.qza
--i-reference-reads ncbi-16s-derep-extracted-14-03-2023.qza
--i-reference-taxonomy ncbi-16s-taxa-derep-14-03-2023.qza
--p-perc-identity 0.97
--p-threads 7
--o-classification taxonomy-vsearch-150f-150r-miseq.qza
--o-search-results vsearch_tophits-150f-150r-miseq.qza

I cannot figure out how my classify-consensus-vsearch is missing this OTU.

Additionally, I hypothesized that one of the most abundant OTUs in the dataset could somehow be confused with Chamelea, leading to misidentification, but the two species only match at 60-70%.Is there something stupid that I'm missing?

I'm sorry for my long message, but there could be some useful details.
Thank you for any help.

Hi @Francesco_Martino

And what is the coverage?

One thought is that that you are setting the %id too high in QIIME 2/vsearch:

97% is very high, esp. if the coverage is < 100% (keep in mind that you are comparing the local aligner BLAST vs. the global aligner VSEARCH). You can also adjust the coverage threshold with classify-consensus-vsearch (but first I would inspect the coverage in the blast results)

1 Like

Thank you for the reply @Nicholas_Bokulich!

This is the output of the blast (%id 0.97):
chamelea_inunassignedb_97.txt (9.3 KB)

In this analysis we want an high id% to discriminate between similar species. By the way, in OBItools the %id was the same. Could it be a problem with global allignment? In the obi analysis the coverage for the Chamelea was pretty high, so even if i didn't fully trust the obi results, it seems odd to me that is completly different in the Qiime2 output.

Hi @Francesco_Martino ,

Thanks for sharing the blast output. Your QIIME 2 unclassified sequences are ~220 nt long, but the query Chamelea gallina sequence is only 167 nt long (and yet only aligns to 100-160 nt of each reference, so does not have 100% coverage). The question now is how much coverage you have between the QIIME 2-output sequences and the NCBI reference sequences.

You can use qiime feature-classifier vsearch-global to generate a report on all hits to the unassigned sequences. Adjust --p-query-cov to reduce the coverage threshold and use the default %id. This will be useful for troubleshooting.

The %id is the lower threshold for accepting a hit, it is not an upper threshold. Vsearch will rank hits by kmer similarity before aligning to find the best hits. So setting a lower %id will only lead to lower-quality hits if a higher-ranked hit does not exist. So don't be afraid of setting a lower %id. Nevertheless, the issue here seems to be coverage, not %id.

What was the coverage with obitools? (not %id)

Not a problem — just a difference (i.e., if you are not enforcing a % coverage threshold with BLAST then this could lead to an even bigger but cryptic problem, i.e., of misclassification due to high %id on a low % coverage sequence). You should adjust the % coverage lower threshold (but first test as described above). You could also run classify-consensus-blast if you want to perform taxonomic classification with blast + LCA consensus.

1 Like

Thank you so much for the detailed replies!

The query in the reference database is 169 nt long. The unassigned one is 219 nt long, but only 161 nt match with 99%id... you were completly right!

Based on my observation, to identify this specific query the coverage should be about 74%, so I'm going to set the '--p-query-cov' to 0.7, and try mantaining the same %id. If doesn't work, i'll try with the default percentage.

Thank you very much! You have been immensely helpful.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.