RDP COI Classifier

Newt · April 18, 2023, 4:40pm

Hello,

I am working on a diet analysis using the COI region and am researching different reference database options. I am interested in potentially using the Porter database (GitHub - terrimporter/CO1Classifier: This repository contains CO1 reference sets that can be used with the RDP Classifier, BLAST, or SINTAX to classify COI metabarcode sequences.). I believe I read that QIIME 2 doesn't support the RDP classifier format (correct?). If I wanted to use this database for the classification method classify-sklearn, would I be able to use reference FASTA and taxonomy files used to train the RDP Classifier from the database instead? Would there be reformatting required? I am a little confused on this, so any insight is greatly appreciated. Thank you!

SoilRotifer · April 18, 2023, 5:33pm

Hi @Newt,

There are quite a few tutorials in this forum on how to make your own COI classifier.

Older:
- COI BOLD
- COI NCBI
Newer:
- Suggested COI workflow using the extract-query-segment approach

Otherwise, if you want a quick way to import the pre-curated RDP formatted files that you linked to, all you need to do is run the following commands:

Extract FASTA header (taxonomy) from RDP FASTA file:
This command will only process the FASTA header lines. It will remove the > character and then replace the first space character with a tab '\t' character, then write the ID and taxonomy to file. There are many other ways to do this, I just quickly did this as an example.

grep '^>' mytrainseq.fasta | sed 's/>//' | sed 's/ /\t/' > taxonomy.tsv

Import taxonomy into QIIME 2:

 qiime tools import \
  --input-path taxonomy.tsv \
  --type 'FeatureData[Taxonomy]' \
  --input-format 'HeaderlessTSVTaxonomyFormat' \
  --output-path taxonomy.qza

Import RDP FASTA file:
We'll use 'MixedCaseDNAFASTAFormat' as these files have non-standard IUPAC lower case nucleotides.

qiime tools import \
    --input-path mytrainseq.fasta \
    --type 'FeatureData[Sequence]' \
    --input-format 'MixedCaseDNAFASTAFormat' \
    --output-path mytrainseq.qza

Optionally perform further curation with RESCRIPt.
For example, dereplicate the sequence data (reduce data size), extract an amplicon region, etc... Something like this:

qiime rescript dereplicate \
    --p-rank-handles 'disable' \
    --p-threads 8 \
    --i-sequences mytrainseq.qza \
    --i-taxa taxonomy.qza \
    --o-dereplicated-sequences derep-seqs.qza \
    --o-dereplicated-taxa derep-taxa.qza

Train and use the classifier:
Optionally use other methods like classify-consensus-blast or classify-consensus-vsearch. Or replace the inputs with the optional dereplication outputs from the previous command.

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads mytrainseq.qza \
  --i-reference-taxonomy taxonomy.qza \
  --o-classifier classifier.qza

qiime feature-classifier classify-sklearn \
  --i-classifier classifier.qza \
  --i-reads rep-seqs.qza \
  --o-classification assigned-taxonomy.qza

Newt · April 18, 2023, 8:12pm

Thank you! I will give this a try. I was intimidated by the idea of making my own COI classifier. It is tricky with the diet studies because there is such a broad range of potential taxa! However I will give that a try as well. Would you suggest the newer method for this application?

SoilRotifer · April 18, 2023, 8:44pm

I think the simplest, and quickest, would be to use the existing files using the commands I provided. Then you can play with other approaches.

system · May 20, 2023, 2:45am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.