Hi @biojack,
There are indeed a variety of files to use... Part of the issue is that many microbial taxa contain multiple copies of the 16S rRNA gene. These copies can be quite different from one another too! From what I gather the all
files contain all of the copies of the 16S rRNA gene from a given genome. If you read the File Descriptions, you'll see that they curate the rep
files differently than the all
files based on what they include / exclude. Often researchers will simply increment the number, or use the sequence positions to differentiate the various 16S rRNA gene copies, e.g. SILVA uses the start and stop positions for the 16S rRNA gene. GTDB appears to use the ~
annotation of RS_GCF_000213495.1~NZ_...
in this example:
This is what the rep
files basically are... sort of... that is, if you read the above linked File Descriptions you'll see that the *_ssu_reps_<release>.tar.gz
sequence file is a :
FASTA file of 16S rRNA gene sequences identified within the set of bacterial representative genomes. The longest identified 16S rRNA sequence is selected for each representative genomes. ...
Below is one way you might consider importing and preparing the GTDB data:
Download and uncompress the files:
# Bacteria
wget https://data.gtdb.ecogenomic.org/releases/release202/202.0/bac120_taxonomy_r202.tsv.gz
gunzip bac120_taxonomy_r202.tsv.gz
wget https://data.gtdb.ecogenomic.org/releases/release202/202.0/genomic_files_reps/bac120_ssu_reps_r202.tar.gz
tar -xvf bac120_ssu_reps_r202.tar.gz
#Archaea
wget https://data.gtdb.ecogenomic.org/releases/release202/202.0/ar122_taxonomy_r202.tsv.gz
! gunzip ar122_taxonomy_r202.tsv.gz
wget https://data.gtdb.ecogenomic.org/releases/release202/202.0/genomic_files_reps/ar122_ssu_reps_r202.tar.gz
! tar -xvf ar122_ssu_reps_r202.tar.gz
Then import the files:
# Bacteria
qiime tools import \
--input-path bac120_ssu_reps_r202.fna \
--type 'FeatureData[Sequence]' \
--output-path bact_seqs.qza
qiime tools import \
--input-path bac120_taxonomy_r202.tsv \
--type 'FeatureData[Taxonomy]' \
--input-format 'HeaderlessTSVTaxonomyFormat' \
--output-path bact_tax.qza
# Archaea
qiime tools import \
--input-path ar122_ssu_reps_r202.fna \
--type 'FeatureData[Sequence]' \
--output-path arch_seqs.qza
qiime tools import \
--input-path ar122_taxonomy_r202.tsv \
--type 'FeatureData[Taxonomy]' \
--input-format 'HeaderlessTSVTaxonomyFormat' \
--output-path arch_tax.qza
Merge the files together:
qiime feature-table merge-taxa \
--i-data bact_tax.qza arch_tax.qza \
--o-merged-data gtdb_tax.qza
qiime feature-table merge-seqs \
--i-data bact_seqs.qza arch_seqs.qza \
--o-merged-data gtdb_seqs.qza
Run any quality control and/or curation via RESCRIPt, see this example, which uses SILVA.
When you are done, you are ready to train your classifier:
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads gtdb_seqs.qza \
--i-reference-taxonomy gtdb_tax.qza \
--o-classifier gtdb_classifier.qza
There are likely other ways to do this, but this should get you started. Note: We could likely parse the taxonomy from the FASTA file directly. But for now, it is much easier to import the individual taxonomy and sequence files separately into QIIME 2. We hope to have a builtin parser for GTDB within RESCRIPt soon.