Using qiime2-2022.11
or later, we can now import DNA and RNA sequence files that contain lower-case sequence characters. Upon import, these nucleotide bases will be converted to the standard upper-case IUPAC format, using the new MixedCase*
import formats. A few examples of which are listed below:
MixedCaseAlignedDNAFASTAFormat
MixedCaseAlignedRNAFASTAFormat
MixedCaseDNAFASTAFormat
MixedCaseRNAFASTAFormat
Example use case: The Ribosomal Database Project (RDP)
We'll import a recent version of the RDP SSU reference files, which have been generally pre-formatted for QIIME, and are available here. Specifically, the RDPClassifier_16S_trainsetNo19_QiimeFormat.zip
file.
In the past, it was not possible to natively import these files into , as they contain lower-case nucleotide characters. We can now do so, following the procedure below:
Download and unzip file
Note different platforms may use slightly different command for unzipping.
wget https://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_TrainingData/RDPClassifier_16S_trainsetNo19_QiimeFormat.zip
unzip RDPClassifier_16S_trainsetNo19_QiimeFormat.zip
cd RDPClassifier_16S_trainsetNo19_QiimeFormat
Import representative sequence file
Use the appropriate file paths to your download location.
qiime tools import \
--input-path RefOTUs.fa \
--output-path rdp_ref_seqs.qza \
--type 'FeatureData[Sequence]' \
--input-format 'MixedCaseDNAFASTAFormat'
Import taxonomy file
Use the appropriate file paths to your download location.
qiime tools import \
--input-path Ref_taxonomy.txt \
--output-path rdp_ref_taxonomy.qza \
--type 'FeatureData[Taxonomy]' \
--input-format 'HeaderlessTSVTaxonomyFormat'
From here you can make use of RESCRIPt, for any further reference sequence and taxonomy curation (e.g. extract a specific amplicon region). For now we'll just skip to making our RDP classifier.
Let's train our RDP classifier * *
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads rdp_ref_seqs.qza \
--i-reference-taxonomy rdp_ref_taxonomy.qza \
--o-classifier rdp_classifier.qza
There you go!
For more information, please see RDP Staff and please cite the following:
- Wang, Q., G. M. Garrity, James M. Tiedje, and J. R. Cole. 2007. “Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy.” Applied and Environmental Microbiology 73 (16): 5261–67. http://dx.doi.org/10.1128/AEM.00062-07
- Wang, Qiong, and James R. Cole. 2024. “Updated RDP Taxonomy and RDP Classifier for More Accurate Taxonomic Classification.” Microbiology Resource Announcements, e0106323. https://doi.org/10.1128/mra.01063-23
If you curate with RESCRIPt:
- Robeson, Michael S., 2nd, Devon R. O’Rourke, Benjamin D. Kaehler, Michal Ziemski, Matthew R. Dillon, Jeffrey T. Foster, and Nicholas A. Bokulich. 2021. “RESCRIPt: Reproducible Sequence Taxonomy Reference Database Management.” PLoS Computational Biology 17 (11): e1009581.
RESCRIPt: Reproducible sequence taxonomy reference database management
Happy -ing!