Hi,
Recently, I was exploring the data as download by the below command per the tutorial.
qiime rescript get-gtdb-data
--p-version '214'
--p-domain 'Both'
--o-gtdb-taxonomy gtdb-214-both-tax.qza
--o-gtdb-sequences gtdb-214-both-seqs.qza
--verbose
It would seem that there is only one sequences per species?
I would presume that this is the reference species for each cluster in the GTDB?
Is there away to download all available SSU (namely 16S) for all species?
Hi @Maurice_Barrett,
This has been on our radar to provide as an option. We've opened up an issue for this here. We've also noted that there is a corrected version of 214, i.e. 214.1, and opened an issue to include this database version too. But in the interim, the following set of commands should help set up the GTDB files for import, and any further curation you'd like to do with RESCRIPt:
Download and unzip ssu_all_r214.tar.gz
wget https://data.gtdb.ecogenomic.org/releases/release214/214.1/genomic_files_all/ssu_all_r214.tar.gz
unzip ssu_all_r214.tar.gz
Grab taxonomy from FASTA header
The following command will extract the taxonomy, remove >, then replace the space before d__ with a tab, and then removes the [] ... annotations that occur after the taxonomy. Finally, write output to tsv file.
egrep '^>' ssu_all_r214.fna | tr -d '>' | sed 's/ d__/\td__/' | sed 's/\[.*//' > ssu_all_r214_taxonomy.tsv
Import taxonomy
qiime tools import \
--input-path ssu_all_r214_taxonomy.tsv \
--input-format HeaderlessTSVTaxonomyFormat \
--type 'FeatureData[Taxonomy]' \
--output-path ssu-all-r214-taxonomy.qza
Import sequence
qiime tools import \
--input-path ssu_all_r214.fna \
--input-format DNAFASTAFormat \
--type 'FeatureData[Sequence]' \
--output-path ssu-all-r214-sequence.qza
Perform optional curation with RESCRPt.
qiime rescript ...
Train classifier
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads ssu-all-r214-sequence.qza \
--i-reference-taxonomy ssu-all-r214-taxonomy.qza \
--o-classifier ssu-all-r214-classifier.qza
Let us know if this works for you.
-Cheers!
-Mike
Hi Mike,
Thank you for your time. This seems like an ideal solution for me. Appreciate the continued work you all do on qiime2.
Kind regards,
Maurice
No worries @Maurice_Barrett! I'm glad we can help. 
I just wanted to let everyone know that we now support downloading GTDB reference data that has not been clustered into species representatives. Please see the updated tutorial. Just install the latest version of RESCRIPt.