How to use GTDB

Hello
I want to create an otu table using gtdb.
Can I use the sequencs of the “ssu_all_r95_last.fna” file and the taxonomy id of the head?

Thank you.

Hello,

I assume you mean you want to train a classifier to use the GTDB taxonomy with a 16S rRNA reference database? If you have the GTDB representative sequences and their associated taxonomy file, then it is just a matter of ensuring they are both in the correct format and importing them into QIIME2 as FeatureData[Taxonomy] and FeatureData[Sequence] types.

For the taxonomy file, you need a headerless TSV file with the following format:

ACCESSION/ID d__bacteria;p__taxa;c_taxa;o__taxa;f__;g__;s__

Unless you're using an older version, in which case its D_0__;D_1__ etc. (you can find more in this post: Importing FeatureData[Taxonomy])

Your reference database would just be a normally formatted fasta file with the format of:

">seq_id
sequence
">seq_id2
sequence

After you have imported these files it should just be a simple case of training and using the classifier (hopefully).

EDIT: I should clarify, that the "seq_id" and "ACCESSION/ID need to be identical for each sequence and associated taxonomy, or else it wont work.

2 Likes

Thank you for answer.
I am curious about the files that need to be downloaded from gtdb.
There are so many files
I am wondering which file ssu_all_r95_last.fna or bac120_ssu_reps_r95.tar.gz should use.

Thank you.

Based on GTDB's file description (which can be found here: GTDB Data - /releases/release95/95.0/), the file you probably want is the bac120_ssu_reps files found within the "genomic_files_reps" folder:

bac120_ssu_reps_.tar.gz
FASTA file of 16S rRNA gene sequences identified within the set of bacterial representative genomes. The longest identified 16S rRNA sequence is selected for each representative genomes. The assigned taxonomy reflects the GTDB classification of the genome. Sequences are identified using nhmmer with the 16S rRNA model (RF00177) from the FAM database. Only sequences with a length >=200 bp and an E-value <= 1e-06 are reported. In a small number of cases, the 16S rRNA sequences are incongruent with this taxonomic assignment as a result of contaminating 16S rRNA sequences.

But this will depend entirely on the question you want answered. I'd suggest taking a look at their release notes, file description, and method files to get a sense of what each file offers and which might be most appropriate for your work.

Cheers

2 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.