Low level taxonomic assignment when using open-reference clustering

Hi,

I am trying to cluster unique sequences after DADA2 by the vserach plugin. Here is my command:
#------------ open ref --------------
qiime vsearch cluster-features-open-reference
--i-table ../table.qza
--i-sequences ../rep-seqs.qza
--i-reference-sequences ../classifierTrain/gg_13_8_otus/rep_set/85_otus.qza
--p-perc-identity 0.85
--o-clustered-table ./openRef/gg-85/table-ref-85.qza
--o-clustered-sequences ./openRef/gg-85/cluster-rep-seq-ref-85.qza
--o-new-reference-sequences ./openRef/gg-85/new-rep-seq-ref-85.qza

For the cluster table, I further classify them by using the trained classifier from greengeen files 85_otus.fasta, 85_otu_taxonomy. The classifier is train by

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads ../classifierTrain/gg_13_8_otus/rep_set/85-otus-seqs.qza
--i-reference-taxonomy ../classifierTrain/gg_13_8_otus/taxonomy/85-otu-taxonomy.qza
--o-classifier ../classifierTrain/classifier-85-paired.qza

The question is after assignment, all of OTUs are assigned only on 'Bacteria' level without other level information, e.g., genus, even for those OTUs from green geen database. See the follows:

When applying close reference or open-reference clustering at 97%, they all look fine. Any ideal on why the assignment doesn't work in open-reference at 85%?

Thanks in advance.

Dong

Hi @lindd,
The reference sequences used for feature classification do not need to match the OTU % id used for OTU picking. In fact, I would never use 85% OTU representative sequences for sequence classification of real data (we use the 85% in the tutorial to speed things up, but not for real analysis). I would always use the 99% OTUs for training classifiers.

So your results could be simply due to training a very uninformative classifier. Classify all query sequences with the 99% OTUs and see if that improves matters.

I hope that helps!

Thanks @Nicholas_Bokulich. i have tried to classify 85% clustered OTUs on the classifier train based on 99% OTUs from databases. Got the following results. Looks those OTUs from databased can be well classified but the others are still not able to be finely classified. Do you have any explain on this?

Thanks.

Dong

Sounds like you are having similar issues to this forum post. The 99% classifier is obviously working, as many of your features are being classified deeply.

There is also a very clear pattern here that suggests that you have the same problem as described in that other thread (non-target DNA): the closed-reference OTUs (these have numbers as IDs) are all classified, but none of the open-reference OTUs (IDs are long strings of letters/numbers) are being classified.

I recommend BLASTing some of the unclassified OTUs to see if they match to, e.g., host DNA (e.g., if your samples are from human, mouse, plant). PhiX may be another issue. Check out that other forum post for a good discussion of:

  1. different problems that can cause this issue
  2. how to diagnose the problem
  3. how to fix the problem by filtering
  4. how to prevent the problem to begin with (e.g., with deblur or dada2)

I hope that helps!

3 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.