Sorry for struggling with this…
I want to assign taxonomy to my reads, say myguanoseqs.qza
.
Am I correct that I start by training the classifier first, using my reference sequences, say COIdbSeqs.qza
and reference taxonomy COIdbTax.qza
.
To do this, I run
qiime feature-classifier fit-classifier
with the COIdb*.qza
as inputs, and generate a classifier.qza
file.
The next step in the tutorial talks about testing the classifier. Maybe that’s where I’m mistaken. I wanted to do that, but also want to actually classify my guano sequences - myguanoseqs.qza
.
I have mock communities as well. Are each of these tests just separate inputs for this function:?
qiime feature-classifier classify-sklearn
My concern about doing this wrong was because, thus far, I’ve only supplied the reference sequences as my input for all steps (either 2 million, or about 1.6 million depending on which database) yet in each case i got back a test result with 10,000 sequences… If I trained with 2 million sequences, and test with 2 million, shouldn’t my results include 2 million comparisons?
Thanks for the help!
ps. the full code executed for a particular database was as follows:
REFSEQ=/path/to/COIdbSeqs.qza
REFTAX=/path/to/COIdbTax.qza
## train the classifier
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads "$REFSEQ" \
--i-reference-taxonomy "$REFTAX" \
--o-classifier classifier_all_raw.qza
## test the classifier
qiime feature-classifier classify-sklearn \
--i-classifier classifier_all_raw.qza \
--i-reads "$REFSEQ" \
--o-classification classifierTax_all_raw.qza
## export data for analysis
qiime tools export --input-path classifierTax_all_raw.qza --output-path classifierTax_all_raw