I have classifier, seqs and tax files obtained from GTDB database for 16S V4 biome analysis. Call them gtdb_clf.qza, gtdb_seqs.qza, gtdb_tax.qza
What I want is to use such classifier to classify human gut samples. As guys hinted me on this forum - this could be done with q2-clawback tool which could help adjust weight on classifier in such way that analyzing some human samples I would practically never get species which never met in human.
So as my understanding - it will classify all <Illumina_16S_V4_samples> according to gtdb_clf.qza then calculate statistics of classsification and according to this stat will provide weights gtdb_animal_gut_clf-weights.qza on which bespoke classifier could be constructed.
So as I understand idea is that some species which could live for example in plants only will have zero weight so on human sample classification there will be zero probability that there would be such misclassification
So, my first question have I understood the idea and process correct?
The second question is my worry about possible misclassification between "Animal distal gut" and "Human gut". In fact in empo_3 habitat types (used in tutorial of q2-clawback) I didn't find exactly human habitat types, only animal. Would that lead to misclassification? Maybe for my purposes some other metadata could be used instead empo_3. Open for all your suggestions
Hi @biojack, thanks for your interest in q2-clawback.
Your understanding is basically correct, except that no prior weights are set to exactly zero, they’re just made very small, so there is still a possibility of classification as any of the taxa in the database. It is also impossible to reduce the probability of misclassification to zero, but using weights certainly helps.
As it turns out, we have some pre-calculated weights using only human stool samples for GTDB classifiers in our online repository here. For instance, GTDB human stool weights for 515f-806r amplicons are available here.
If that doesn’t match the version that you’re using or you have any other issues, you can build human stool weights from scratch using the very last command on the tutorial that you already mentioned. (Down below the Stilton example.)
I hope that helps, please don’t hesitate if you have any further questions.
@BenKaehler yes, that's already helped a lot, thank you!
In fact I use GTDB-202 sample, but I wanna have a look on your link too (GTDB-89 weights). How could I look at weights just as text file? I tried qiime tools export but he said that's that file not qiime artifact.
I certainly would like to construct weights for GTDB-202. I looked at Stilton example and tried to go with Stilton example. I did
redbiom search metadata "where host_taxid==9606 and (sample_type=='stool' or sample_type=='Stool')" > sample_ids
On the forum guys adviced that bespoke classifier ( classifier with non-uniform class weights. Those weights proportional to observed taxon frequencies in domain of interests ) could lead to much better taxon detection results on species level.
So looks like great opportunity. But my concerns are
Could dataset giving obtained weights also give technical or biological batch effect to these weights?
Suppose we have dataset called A. We analyzed this dataset A with uniform classifier and then constructed bespoke classifier from read counts. What happens if we will repeat again this process according dataset A but with bespoke classifier on input? Will be process stabilized after 10 such iterations? Did anyone check this?
Would be possible to catch rare patogenes with bespoke classifier?
Also optional question related to article mentioned above. In article there is written
Where we have set the class weights to the known taxonomic composition of a sample, we have labeled the results “bespoke”
For my understanding it means that non-zero weights was assigned only for species which exactly tested from mock-community (20 species) by prior knowledge of bacteria composition. So there are almost no way for classifier work wrong. Is that correct understanding? If so, do you still think that results of this article confirming benefits of bespoke classifier? Because in real analysis there are no such prior knowledge
These are two diffetent topics - current one for technical understanding of q2-clawback and its opportunities; and new one is about general limiting and concerns of bespoke approach. Mixing both is not the best idea.