taxonomy.qza update function

szitenberg · September 1, 2021, 6:23am

Hi

Would it be possible to create "feature-classifier classify-sklearn" updater function, which will be the same as the original function, but it will also receive an existing taxonomy.qza file as input and will only classify ASVs that are not already represented in the existing file?

Taxonomy assignment can be time consuming so it would be nice to have to do only new ASVs if the data is extended mid project.

Thanks.

Nicholas_Bokulich · September 1, 2021, 6:55am

Hi @szitenberg ,

Thanks for the suggestion — and welcome to the forum!

I recognize the use case, but there are other existing solutions:

If you have new data you are merging with old data/ASVs, you can just filter out those old ASVs. So, for example let's imagine that you have two artifacts containing ASVs:
A: from the original run
B: from a new run, which presumably contains new ASVs but also lots of ASVs already observed in A

you can filter B to only contain unseen ASVs like so:

qiime feature-table filter-seqs --i-data B.qza --m-metadata-file A.qza --p-exclude-ids --o-filtered-data B-unseen.qza

Then classify the output (B-unseen.qza). Meanwhile artifacts A and B can be merged directly and will drop duplicates automatically.

The metadata file used for filtering can also be a taxonomy artifact, so for example you can merge 1000 FeatureData[Sequence] artifacts into a single one, then filter out sequences that have already been taxonomically classified, then proceed... so this filtering operation does not need to be done on each set of new data individually if you have multiple new runs/datasets that you wish to compare.

Sure, classify-sklearn could also have an option to perform this same operation under the hood and save you a step... but there are a few reasons not to:

doing so would lead to slightly less clear provenance. One would need to dig into the details quite deeply and know the API to check that a filtering step was performed. Using filter-seqs leaves no ambiguity about what operations were performed.
classify-sklearn is not the only action that performs taxonomy classification, so it would be somewhat repetitive and add to maintenance burden to expose such a filtering function in different actions.

So the benefits (which would be convenient indeed, but only to a fraction of users) do not really outweight the costs.

Would the workaround above work for you? Very happy to get more input from you/others on this!

szitenberg · September 1, 2021, 7:12am

Great, and thanks for the detailed response.