GTDB files for Taxonomy build

SoilRotifer · May 19, 2022, 2:17pm

Those IDs are not repeating. I noted my initial reply, here is the ID below. That is, the standard FASTA header format considers anything prior to the first space the full ID.
>RS_GCF_000213495.1~NZ_AFHD01000036.1

Yes, this is intended for the rep seqs, as outlined here.

Here is another approach to import everything for use as a classifier:

Download and extract full ssu file:

wget https://data.gtdb.ecogenomic.org/releases/release202/202.0/genomic_files_all/ssu_all_r202.tar.gz

tar -xvf ssu_all_r202.tar.gz

Extract and parse the FASTA header and write to file:

# Pull the header, keep the first two items (seqID and Taxonomy label), remove '>', and replace ' ' (space) with '\t' (tab)
egrep '^>' ssu_all_r202.fna | cut -d ' ' -f1,2 | sed 's/>//; s/ /\t/' > ssu_all_r202_tax.tsv

Then import as a taxonomy file:

qiime tools import \
    --input-path ssu_all_r202_tax.tsv \
    --type 'FeatureData[Taxonomy]' \
    --input-format 'HeaderlessTSVTaxonomyFormat' \
    --output-path ssu_all_r202_tax.qza

Then import the FASTA file as is:

qiime tools import \
    --input-path ssu_all_r202.fna \
    --type 'FeatureData[Sequence]' \
    --output-path ssu_all_r202_seqs.qza

Perform QA/QC through RESCRIPt if needed. Then build classifier:

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads ssu_all_r202_seqs.qza \
    --i-reference-taxonomy ssu_all_r202_tax.qza \
    --o-classifier gtdb_classifier.qza

Test classifier:

qiime feature-classifier classify-sklearn \
  --i-classifier gtdb_classifier.qza \
  --i-reads rep-seqs.qza \
  --p-n-jobs 4 \
  --o-classification taxonomy.qza

qiime metadata tabulate \
  --m-input-file taxonomy.qza \
  --o-visualization taxonomy.qzv

I just tested this locally and it appears to work. Let us know if this works for you too.