I’m giving a first whack at training my custom COI database and am following the tutorial guidelines. One caveat - I’m training two separate databases: one that contains all ~ 2 million arthropod records (ALL) and another database that contains only references that contain at least taxonomic information (O-plus).
There’s actually 4 training sessions happening - 2 on each database. For each database input, I’m testing a trimmed and untrimmed reference set. The trimming is still ongoing, but the untrimmed reads have been trained and I’ve generated output.
Regarding that output: am I correct in the classifier’s documentation that there is a default number of features that are trained? I was surprised after exporting the equivalent of the
--o-classification taxonomy.qza file from this part of the tutorial I was thinking I was going to get a list of 2 million taxa. Instead, I get just over 10,000.
The almost identical number of trained taxa are present in both the ALL and O-plus outputs, which makes me think that there is a default I’m not picking up. I thought perhaps it might be
--p-feat-ext--n-features, but that isn’t the default number I’m seeing.
In addition to this specific question, I was wondering what parameters experienced users might suggest tweaking. Computational resources or time aren’t the issue here - I’d prefer accuracy even if it takes a week.