We have downloaded marine invertebrate sequence and taxonomy data from NCBI (loosely following this Building a COI database from NCBI references) and then used feature-table merging (as per Merge separate sequence and taxonomy artifacts output from RESCRIPt) as we had to download the data in small chunks to avoid downloads failing.
The way we downloaded was to try downloading from a high taxonomic rank and if that was too big, we downloaded the separate taxa at the next rank down. This means that the downloads are at various taxonomic levels, depending on the size of the taxa.
We then constructed our classifier using feature-classifier fit-classifier-naive-bayes and used it with feature-classifier classify-sklearn with our sequence data.
This results in our ASVs being identified either to the sub-taxa where they belong or to the top rank included in the taxonomy (metazoa). This is not as such unexpected, but it would be nice to have unidentified things end up in the nearest higher level rank. For example, we found that mollusca was too big to download in one group, so it was split at the next taxonomic rank. Within that, gastropoda was also to large, and was split up into its lower groups. Now when we have an unidentified gastropod, this does not become assigned to gastropoda or even mollusca, but is assigned to metazoa (we think). This as far as I can tell, is by necessity, as the information needed for it to classified as unidentified gastropoda or mollusca is not included in our classifier (as it only includes lower taxonomic ranks).
Is there a way to construct a classifier from piecemeal NCBI downloads that would "know" which higher taxon to assign an unidentified ASV to? Should we download unclassified data for the taxonomic ranks that are too large in addition to the identified lower ranks?
To explain why we want all of these unclassified; the vast majority our reads are currently being classified as "metzoa", so we believe it would at least help us determine approximately what we have in our data.