I have a question regarding machine learning metadata based sample classification.
Let's say I want to classify some upcoming unlabelled samples with a already created/trained classifier, based on ASV level.
As ASV names are randomly assigned when clustering of sequences is performed, if I run these new samples in q2 separately from the samples used to train the classifier, chances are that, sequence-wise, same ASVs will have different names. Is there a way to mitigate this problem? Or better to ask, is it possible to, somehow, obtain same ASV names across different q2 runs?
If i have used 500+ samples to create the classifier, and have to re-run q2 each time new samples are coming in, this would seem like a tedious task.
Another question, if I agglomerate sequences to specific, tax level- e.g., genus and train classifier based on that, would this ensure interoperability of classifier for different q2 runs?
This is the case if OTU clustering is used, but not with ASVs (i.e., obtained by denoising — they are no longer ASVs if you cluster them).
It sounds like you are using OTU clustering — in this case the same OTU IDs can be obtained by either;
using a reference-based clustering approach (in which case the OTU IDs are not random, but are adopted from the closest reference sequence)
use an open-reference clustering approach (see the docs), which allows you to input a reference of OTUs from the previous run. This can be done in an iterative fashion (i.e., run 1's OTUs get passed to run 2 for clustering, run 2's to run 3, etc).
Sure, this is another option. It sort of depends though... different runs with the same primers/variable region and similar lengths (after trimming) should have similar taxonomic assignments. But runs from different variable regions might obtain different taxonomic classifications. So it can be messy, but for sure this is an option.
Actually I am using denoising (wrong wording I guess ). Would you care to elaborate how two completely different DADA2 runs can obtain same ASV labelling?
Hello!
Hope that it is OK that I will join the discussion.
In Dada2, ASV ids are actually MD5 hashes of sequences itself. So, the same sequence will always get the same ID.
If you have several runs with the same primers, it is better to run Dada2 for each run, using the same (!) parameters for Dada2, and then merge representative sequences and feature tables.
In my case even like this I was able still to predict sequencing run with 80-90% accuracy. Collapsing to taxa level can decrease sequencing run predictability, but one need to check how it will influence such of other (more important, I guess) metadata categories.
Thanks for qiiming in. Just came across this thread, providing the same explanation. Sorry for the duplicate thread.
In Dada2, ASV ids are actually MD5 hashes of sequences itself. So, the same sequence will always get the same ID.
I actually didn't know this!
But do I even need to merge anything (now that I know that ASV as hash based and I run dada2 with same params) if a just want to predict the metadata of a new sample? As I understand, I would need only to run the already established model against a new obtained feature table?
It depends on your workflow you are going to implement. You can run already established classifier on new feature table or merge files to train a model with merged inputs.