Alright, here are the results. I did end up figuring out how to train a new version of bold_anml_classifier.qza. It is in the linked folder at the bottom and named bold_anml_classifier2.qza to avoid confusion, though it should be exactly the same as the original. This should be usable on QIIME2v2021.2.0 and feature-classifierv2021.2.0. Can’t make any promises past that. More details on how I trained it using less RAM are below (tl;dr use feature-classifier instead of rescript).
I got a little carried away and did comparisons between the classifiers trained on the full anml dataset (anml, ~740,000 records), the anml dataset filtered for all records listing “United States” or “Canada” as their location (anmlUSCA, ~300,000 records), and two (untrimmed) training datasets I had already compiled. The untrimmed datasets aren’t a perfect comparison since they were downloaded and filtered differently, but I think they’re at least a little informative. They were downloaded directly from the BOLD online database in March 2021 and then filtered with a Python script I’ll link at the bottom. The main filters are: remove duplicate records, remove any records with unallowed characters, remove records that aren’t COI-5P. The two downloaded datasets are all records from a search for “United States” & “Canada” (USCA, ~200,000 records) and all records from a search for a list of eastern states and provinces, which actually is a larger dataset (EastUSCAOnt, ~280,000 records). Exact search terms for each are linked below. I had to download some in chunks, which possibly affects what records were included. All comparisons are made based on a ~1000 sample library of bird fecal samples (and blanks from both DNA extraction & PCR) denoised with dada2 (~3500 ASVs). COI was amplified using ZBJ primers. For each naive-Bayes classifier, I ran them at a confidence threshold of 0.7, 0.5, and 0.3. I also used BLAST to classify the data using each of the reference datasets and a few percent identity thresholds as a kind of reference.
Comparing anml to anmlUSCA:
In general, anmlUSCA identified features to a higher taxonomic level, but it depended a bit on the confidence parameter. At 0.7 they were almost exactly the same (anml = 59% to species, anmlUSCA = 60% to species). At 0.5 it was 71% and 76% and at 0.3 it was 81% and 91% to species for anml and anmlUSCA respectively. There’s a comparison qzv linked at the bottom if you want to look at more details. Same general trend holds for BLAST. Looks to me like adding the training data from outside of the US and Canada reduces the confidence of the classifier by adding a bunch of sequences that actually don’t exist in the study area, but potentially that’s real uncertainty which the filtered dataset doesn’t capture.
Comparing to untrimmed data (also filtered differently):
At the species level, the naive-Bayes classifiers trained on trimmed data classifies more, but at higher taxonomic levels it’s more mixed. BLAST is the other way around though, classifying more to species with the untrimmed data. My guess is that’s mostly incorrect classifications in BLAST (matching sequences outside the target sequence), but it’s hard to know.
Training bold_anml_classifier2.qza:
Using feature-classifier I was able to train a naive Bayes classifier with less computing power. I don’t know the exact stats, but it took about 8.5 hours and maxed out at <65 GB of RAM (the total memory of the server I was on).
qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads bold_anml_seqs.qza
--i-reference-taxonomy bold_anml_taxa.qza
--p-classify--chunk-size 2000
--o-classifier bold_anml_classifier2.qza
Though I used a chunk size of 2000, a chunk size of 4000 also looked like it was going to peak at <64 GB RAM before I stopped it for unrelated reasons. Chunk size of 5000 got killed because I hit my ~64 GB RAM limit.
Last, here’s a link to all the files, code, etc, but first a note. 1) This probably goes without saying, but the data I’m using to compare classifiers is my thesis data, so anyone can feel free to use it for tinkering with classifiers, but nothing else. If you’re interested in it past that, feel free to reach out. The link: OSF | COI Database Cont.