Only one per species regarding SSU reference taxonomy and sequences from GTDB

Hi,

Recently, I was exploring the data as download by the below command per the tutorial.

qiime rescript get-gtdb-data
--p-version '214'
--p-domain 'Both'
--o-gtdb-taxonomy gtdb-214-both-tax.qza
--o-gtdb-sequences gtdb-214-both-seqs.qza
--verbose

It would seem that there is only one sequences per species?

I would presume that this is the reference species for each cluster in the GTDB?

Is there away to download all available SSU (namely 16S) for all species?

Hi @Maurice_Barrett,

This has been on our radar to provide as an option. We've opened up an issue for this here. We've also noted that there is a corrected version of 214, i.e. 214.1, and opened an issue to include this database version too. But in the interim, the following set of commands should help set up the GTDB files for import, and any further curation you'd like to do with RESCRIPt:

Download and unzip ssu_all_r214.tar.gz

wget https://data.gtdb.ecogenomic.org/releases/release214/214.1/genomic_files_all/ssu_all_r214.tar.gz
unzip ssu_all_r214.tar.gz

Grab taxonomy from FASTA header
The following command will extract the taxonomy, remove >, then replace the space before d__ with a tab, and then removes the [] ... annotations that occur after the taxonomy. Finally, write output to tsv file.

egrep '^>' ssu_all_r214.fna | tr -d '>' | sed 's/ d__/\td__/' | sed 's/\[.*//' > ssu_all_r214_taxonomy.tsv

Import taxonomy

qiime tools import \
    --input-path ssu_all_r214_taxonomy.tsv \
    --input-format HeaderlessTSVTaxonomyFormat \
    --type 'FeatureData[Taxonomy]' \
    --output-path ssu-all-r214-taxonomy.qza

Import sequence

qiime tools import \
    --input-path ssu_all_r214.fna \
    --input-format DNAFASTAFormat \
    --type 'FeatureData[Sequence]' \
    --output-path ssu-all-r214-sequence.qza

Perform optional curation with RESCRPt.

qiime rescript ...

Train classifier

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads ssu-all-r214-sequence.qza \
    --i-reference-taxonomy ssu-all-r214-taxonomy.qza \
    --o-classifier ssu-all-r214-classifier.qza

Let us know if this works for you.

-Cheers!
-Mike

3 Likes

Hi Mike,

Thank you for your time. This seems like an ideal solution for me. Appreciate the continued work you all do on qiime2.

Kind regards,
Maurice

2 Likes

No worries @Maurice_Barrett! I'm glad we can help. :slight_smile:

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.

I just wanted to let everyone know that we now support downloading GTDB reference data that has not been clustered into species representatives. Please see the updated tutorial. Just install the latest version of RESCRIPt.

4 Likes