Must Classifiering after clustering based on cluster-features-open-reference?

Dear Qiime2 community,

I have a question about classifier after clustering based on cluster-features-open-reference.

I use cluster-features-open-reference after dada2 for sequences clustering, with gg_13_8 for reference. Do I need classifier the whole data with gg_13_8 again? It is easy to understand to taxonomy classifier for the new sequences. But what about these alreadly identified in the reference?

I have tried below three classifiers:

  1. classify-consensus-blast,
  2. classify-consensus-vsearch,
  3. classify-sklearn. (use the Greengenes 13_8 99% OTUs full-length sequences downloaded from the QIIME2 data resources as the trianed classifier.)

Results from the first two are quite similar. But the taxonomy information is not the same as its identify number.

From our wet lab infor, we know that there are spike-in (two bacterias as below) in some specific samples:

  1. k__Bacteria;p__Bacteroidetes;c__Flavobacteriia;o__Flavobacteriales;f__Flavobacteriaceae;g__Imtechella;s__halotolerans;
  2. k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae;

There taxonomy IDs in greengens database are 4416544 and 361710, respectively.

And we can see them after cluster-features-open-reference. But after classify-consensus-vsearch or classify-consensus-blast, the related taxonomy IDs point to:

  1. k__Bacteria;p__Bacteroidetes;c__Flavobacteriia;o__Flavobacteriales;f__Flavobacteriaceae;
  2. k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae;g__Bacillus, respectively. Although the same two families, but different obviously.

This may not that matter in research projects, but really count when in clinical tests. And we are the later one.

(BTW, result from the third one performs not so good. As there are so many OTUs have
been not identified by it. So I think we are not going to choose the third one.)

My questions here are two as below:

  1. Why ther taxonomy information different after cluster-features-open-reference and classify-consensus-vsearch? Which one should we belive?
  2. Regarding to this, do we have to use the results after classify-consensus-vsearch? Or we can use the below infor as our final taxonomy result?
    1). have identifed number in: cluster-features-open-reference
    2). unassigned in open-reference, but identified in classify-consensus-vsearch
    3). other unassigned bacterias either by cluster-features-open-reference or classify-consensus-vsearch

Thank you so much in advance!

Hello @JoyHe,

Welcome to the forums! I can explain what's going on, and why you may want to reconsider your taxonomy results, especially for clinical testing.

cluster-features-open-reference is a pipeline that builds OTUs in two passes:

  1. Count reads that match to OTUs in reference database
  2. De novo cluster reads that do not match into new OTUs

That first pass only returns OTUs already in your database, along with their existing taxonomy.

Any of the three classify-* methods you mention ignore this existing taxonomy and provide a new classification.

As you noticed, some of these new classifications are do not go to the same classification level as found in the database. :scream_cat:

This is the question! Check out the benchmark paper for these classifiers:

But surely you can trust the classification in the database?! :exploding_head: :point_down:

~20% of taxonomy annotations in SILVA and Greengenes are wrong
Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences [PeerJ]

This is the core challenge of marker gene classification. Check out those papers and let us know if they convince you a method is better.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.