Understanding q2-clawback weights learning

Hi, all

I have classifier, seqs and tax files obtained from GTDB database for 16S V4 biome analysis. Call them gtdb_clf.qza, gtdb_seqs.qza, gtdb_tax.qza

What I want is to use such classifier to classify human gut samples. As guys hinted me on this forum - this could be done with q2-clawback tool which could help adjust weight on classifier in such way that analyzing some human samples I would practically never get species which never met in human.

So as I understand from that tutorial Using q2-clawback to assemble taxonomic weights - I need to do such thing

qiime clawback assemble-weights-from-Qiita
--i-classifier gtdb_clf.qza
--i-reference-taxonomy gtdb_tax.qza
--i-reference-sequences gtdb_seqs.qza
--p-metadata-key empo_3
--p-metadata-value "Animal distal gut"
--p-context <Illumina_16S_V4_samples>
--o-class-weight gtdb_animal_gut_clf-weights.qza

So as my understanding - it will classify all <Illumina_16S_V4_samples> according to gtdb_clf.qza then calculate statistics of classsification and according to this stat will provide weights gtdb_animal_gut_clf-weights.qza on which bespoke classifier could be constructed.

So as I understand idea is that some species which could live for example in plants only will have zero weight so on human sample classification there will be zero probability that there would be such misclassification

So, my first question have I understood the idea and process correct?

The second question is my worry about possible misclassification between "Animal distal gut" and "Human gut". In fact in empo_3 habitat types (used in tutorial of q2-clawback) I didn't find exactly human habitat types, only animal. Would that lead to misclassification? Maybe for my purposes some other metadata could be used instead empo_3. Open for all your suggestions :slightly_smiling_face:

Thank you for your attention

Hi @biojack, thanks for your interest in q2-clawback.

Your understanding is basically correct, except that no prior weights are set to exactly zero, they’re just made very small, so there is still a possibility of classification as any of the taxa in the database. It is also impossible to reduce the probability of misclassification to zero, but using weights certainly helps.

As it turns out, we have some pre-calculated weights using only human stool samples for GTDB classifiers in our online repository here. For instance, GTDB human stool weights for 515f-806r amplicons are available here.

If that doesn’t match the version that you’re using or you have any other issues, you can build human stool weights from scratch using the very last command on the tutorial that you already mentioned. (Down below the Stilton example.)

I hope that helps, please don’t hesitate if you have any further questions.

2 Likes

@BenKaehler yes, that's already helped a lot, thank you!

In fact I use GTDB-202 sample, but I wanna have a look on your link too (GTDB-89 weights). How could I look at weights just as text file? I tried qiime tools export but he said that's that file not qiime artifact.

I certainly would like to construct weights for GTDB-202. I looked at Stilton example and tried to go with Stilton example. I did

redbiom search metadata "where host_taxid==9606 and (sample_type=='stool' or sample_type=='Stool')" > sample_ids

then

redbiom fetch samples --from sample_ids --context Deblur-Illumina-16S-V4-150nt-780653 --output samples.biom

and got

ValueError: Unknown context: Deblur-Illumina-16S-V4-150nt-780653

Hi, all

On the forum guys adviced that bespoke classifier ( classifier with non-uniform class weights. Those weights proportional to observed taxon frequencies in domain of interests ) could lead to much better taxon detection results on species level.

For example there is article Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin - PMC there are just perfect results on mock-communities. More over there is q2-clawback plugin which could improve to build such classifier Using q2-clawback to assemble taxonomic weights

So looks like great opportunity. But my concerns are

  • Could dataset giving obtained weights also give technical or biological batch effect to these weights?
  • Suppose we have dataset called A. We analyzed this dataset A with uniform classifier and then constructed bespoke classifier from read counts. What happens if we will repeat again this process according dataset A but with bespoke classifier on input? Will be process stabilized after 10 such iterations? Did anyone check this?
  • Would be possible to catch rare patogenes with bespoke classifier?

Also optional question related to article mentioned above. In article there is written

Where we have set the class weights to the known taxonomic composition of a sample, we have labeled the results “bespoke”

For my understanding it means that non-zero weights was assigned only for species which exactly tested from mock-community (20 species) by prior knowledge of bacteria composition. So there are almost no way for classifier work wrong. Is that correct understanding? If so, do you still think that results of this article confirming benefits of bespoke classifier? Because in real analysis there are no such prior knowledge

Thank you much for your attention.

@Nicholas_Bokulich, hi!

Could you return my previous reply to topic I created Concerns about bespoke classifier ?

These are two diffetent topics - current one for technical understanding of q2-clawback and its opportunities; and new one is about general limiting and concerns of bespoke approach. Mixing both is not the best idea.