Importing sequence data with lower-case nucleotide characters. Constructing an RDP classifier as an example.

With the recent qiime2-2022.11 release, we can now import DNA and RNA sequence files that contain lower-case sequence characters. Upon import, these nucleotide bases will be converted to the standard upper-case IUPAC format, using the new MixedCase* import formats. A few examples of which are listed below:

  • MixedCaseAlignedDNAFASTAFormat
  • MixedCaseAlignedRNAFASTAFormat
  • MixedCaseDNAFASTAFormat
  • MixedCaseRNAFASTAFormat

Example use case: The Ribosomal Database Project (RDP)
We'll import a recent version of the RDP SSU reference files, which have been generally pre-formatted for QIIME, and are available here. Specifically, the file.

In the past, it was not possible to natively import these files into :qiime2:, as they contain lower-case nucleotide characters. We can now do so, following the procedure below:

Download and unzip file :inbox_tray:
Note different platforms may use slightly different command for unzipping.


unzip RDP_Classifier_TrainingData/

Import representative sequence file :arrow_backward:
Use the appropriate file paths to your download location.

 qiime tools import \
    --input-path RefOTUs.fa \
    --output-path rdp_ref_seqs.qza \
    --type 'FeatureData[Sequence]' \
    --input-format 'MixedCaseDNAFASTAFormat'

Import taxonomy file :arrow_backward:
Use the appropriate file paths to your download location.

qiime tools import \
    --input-path Ref_taxonomy.txt \
    --output-path rdp_ref_taxonomy.qza \
    --type 'FeatureData[Taxonomy]' \
    --input-format 'HeaderlessTSVTaxonomyFormat'

:building_construction: From here you can make use of RESCRIPt, for any further reference sequence and taxonomy curation (e.g. extract a specific amplicon region). For now we'll just skip to making our RDP classifier.

Let's train our RDP classifier *:train: *

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads rdp_ref_seqs.qza \
    --i-reference-taxonomy rdp_ref_taxonomy.qza \
    --o-classifier rdp_classifier.qza 

There you go!

For more information, please see RDP Staff and please cite the following:

  • Wang, Q., G. M. Garrity, James M. Tiedje, and J. R. Cole. 2007. “Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy.” Applied and Environmental Microbiology 73 (16): 5261–67.
  • Robeson, Michael S., 2nd, Devon R. O’Rourke, Benjamin D. Kaehler, Michal Ziemski, Matthew R. Dillon, Jeffrey T. Foster, and Nicholas A. Bokulich. 2021. “RESCRIPt: Reproducible Sequence Taxonomy Reference Database Management.” PLoS Computational Biology 17 (11): e1009581.
    RESCRIPt: Reproducible sequence taxonomy reference database management

Happy :qiime2: -ing!


An off-topic reply has been split into a new topic: MixedCaseDNAFASTA not found

Please keep replies on-topic in the future.