Hi @szitenberg ,
Thanks for the suggestion — and welcome to the forum!
I recognize the use case, but there are other existing solutions:
If you have new data you are merging with old data/ASVs, you can just filter out those old ASVs. So, for example let's imagine that you have two artifacts containing ASVs:
A: from the original run
B: from a new run, which presumably contains new ASVs but also lots of ASVs already observed in A
you can filter B to only contain unseen ASVs like so:
qiime feature-table filter-seqs --i-data B.qza --m-metadata-file A.qza --p-exclude-ids --o-filtered-data B-unseen.qza
Then classify the output (
B-unseen.qza). Meanwhile artifacts A and B can be merged directly and will drop duplicates automatically.
The metadata file used for filtering can also be a taxonomy artifact, so for example you can merge 1000
FeatureData[Sequence] artifacts into a single one, then filter out sequences that have already been taxonomically classified, then proceed... so this filtering operation does not need to be done on each set of new data individually if you have multiple new runs/datasets that you wish to compare.
classify-sklearn could also have an option to perform this same operation under the hood and save you a step... but there are a few reasons not to:
- doing so would lead to slightly less clear provenance. One would need to dig into the details quite deeply and know the API to check that a filtering step was performed. Using
filter-seqs leaves no ambiguity about what operations were performed.
classify-sklearn is not the only action that performs taxonomy classification, so it would be somewhat repetitive and add to maintenance burden to expose such a filtering function in different actions.
So the benefits (which would be convenient indeed, but only to a fraction of users) do not really outweight the costs.
Would the workaround above work for you? Very happy to get more input from you/others on this!