Hi @Newt,
There are quite a few tutorials in this forum on how to make your own COI classifier.
- Older:
- Newer:
Otherwise, if you want a quick way to import the pre-curated RDP formatted files that you linked to, all you need to do is run the following commands:
-
Extract FASTA header (taxonomy) from RDP FASTA file:
This command will only process the FASTA header lines. It will remove the>
character and then replace the first space character with a tab '\t' character, then write the ID and taxonomy to file. There are many other ways to do this, I just quickly did this as an example.
grep '^>' mytrainseq.fasta | sed 's/>//' | sed 's/ /\t/' > taxonomy.tsv
- Import taxonomy into QIIME 2:
qiime tools import \
--input-path taxonomy.tsv \
--type 'FeatureData[Taxonomy]' \
--input-format 'HeaderlessTSVTaxonomyFormat' \
--output-path taxonomy.qza
-
Import RDP FASTA file:
We'll use 'MixedCaseDNAFASTAFormat' as these files have non-standard IUPAC lower case nucleotides.
qiime tools import \
--input-path mytrainseq.fasta \
--type 'FeatureData[Sequence]' \
--input-format 'MixedCaseDNAFASTAFormat' \
--output-path mytrainseq.qza
-
Optionally perform further curation with RESCRIPt.
For example, dereplicate the sequence data (reduce data size), extract an amplicon region, etc... Something like this:
qiime rescript dereplicate \
--p-rank-handles 'disable' \
--p-threads 8 \
--i-sequences mytrainseq.qza \
--i-taxa taxonomy.qza \
--o-dereplicated-sequences derep-seqs.qza \
--o-dereplicated-taxa derep-taxa.qza
-
Train and use the classifier:
Optionally use other methods likeclassify-consensus-blast
orclassify-consensus-vsearch
. Or replace the inputs with the optional dereplication outputs from the previous command.
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads mytrainseq.qza \
--i-reference-taxonomy taxonomy.qza \
--o-classifier classifier.qza
qiime feature-classifier classify-sklearn \
--i-classifier classifier.qza \
--i-reads rep-seqs.qza \
--o-classification assigned-taxonomy.qza