GTDB files for Taxonomy build

biojack · May 18, 2022, 1:40pm

Hi, all

I want to try use GTDB files for building classifier on 16S biome analysis. I'm bit confusing what files exactly should I take

I'm interested in 202 release of GTDB

So it looks like for taxonomy I should take files ar122_taxonomy_r202.tsv and bac120_taxonomy_r202.tsv

and sequences from ssu_all_r202.tar.gz

But what confusing me is some unmatching of taxonomy and sequence tags. For example in Tax file I have line

RS_GCF_000213495.1	d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia flexneri

And in fasta file I have lines

>RS_GCF_000213495.1~NZ_AFHD01000036.1 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia flexneri [location=3567..5105] [ssu_len=1538] [contig_len=293796]

and

>RS_GCF_000213495.1~NZ_AFHD01000041.1 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia flexneri [location=3566..5104] [ssu_len=1538] [contig_len=142194]

So in fact for one record RS_GCF_000213495.1 in taxonomy file I have multiple records in fasta file. What should I do with them? Remain only random one or could remail all repeating IDs? Or probably I took wrong (not best) files from GTDB ? Also do they have NR99 version of sequences or this routine should be on user side?

I also looked at file bac120_ssu_reps_r202.tar.gz

it looks like keeping non-repeating IDs, but some IDs are lost there

Thank you much for your attention

SoilRotifer · May 18, 2022, 6:20pm

Hi @biojack,

There are indeed a variety of files to use... Part of the issue is that many microbial taxa contain multiple copies of the 16S rRNA gene. These copies can be quite different from one another too! From what I gather the all files contain all of the copies of the 16S rRNA gene from a given genome. If you read the File Descriptions, you'll see that they curate the rep files differently than the all files based on what they include / exclude. Often researchers will simply increment the number, or use the sequence positions to differentiate the various 16S rRNA gene copies, e.g. SILVA uses the start and stop positions for the 16S rRNA gene. GTDB appears to use the ~ annotation of RS_GCF_000213495.1~NZ_... in this example:

This is what the rep files basically are... sort of... that is, if you read the above linked File Descriptions you'll see that the *_ssu_reps_<release>.tar.gz sequence file is a :

FASTA file of 16S rRNA gene sequences identified within the set of bacterial representative genomes. The longest identified 16S rRNA sequence is selected for each representative genomes. ...

Below is one way you might consider importing and preparing the GTDB data:

Download and uncompress the files:

# Bacteria
wget https://data.gtdb.ecogenomic.org/releases/release202/202.0/bac120_taxonomy_r202.tsv.gz
gunzip bac120_taxonomy_r202.tsv.gz

wget https://data.gtdb.ecogenomic.org/releases/release202/202.0/genomic_files_reps/bac120_ssu_reps_r202.tar.gz
tar -xvf bac120_ssu_reps_r202.tar.gz

#Archaea
wget https://data.gtdb.ecogenomic.org/releases/release202/202.0/ar122_taxonomy_r202.tsv.gz
! gunzip ar122_taxonomy_r202.tsv.gz

wget https://data.gtdb.ecogenomic.org/releases/release202/202.0/genomic_files_reps/ar122_ssu_reps_r202.tar.gz
! tar -xvf  ar122_ssu_reps_r202.tar.gz

Then import the files:

# Bacteria
qiime tools import \
    --input-path bac120_ssu_reps_r202.fna \
    --type 'FeatureData[Sequence]' \
    --output-path bact_seqs.qza

qiime tools import \
    --input-path bac120_taxonomy_r202.tsv \
    --type 'FeatureData[Taxonomy]' \
    --input-format 'HeaderlessTSVTaxonomyFormat' \
    --output-path bact_tax.qza


# Archaea
qiime tools import \
    --input-path ar122_ssu_reps_r202.fna \
    --type 'FeatureData[Sequence]' \
    --output-path arch_seqs.qza

qiime tools import \
    --input-path ar122_taxonomy_r202.tsv \
    --type 'FeatureData[Taxonomy]' \
    --input-format 'HeaderlessTSVTaxonomyFormat' \
    --output-path arch_tax.qza

Merge the files together:

qiime feature-table merge-taxa \
    --i-data bact_tax.qza arch_tax.qza \
    --o-merged-data gtdb_tax.qza

qiime feature-table merge-seqs \
    --i-data bact_seqs.qza arch_seqs.qza \
    --o-merged-data gtdb_seqs.qza

Run any quality control and/or curation via RESCRIPt, see this example, which uses SILVA.

When you are done, you are ready to train your classifier:

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads gtdb_seqs.qza \
    --i-reference-taxonomy gtdb_tax.qza \
    --o-classifier gtdb_classifier.qza

There are likely other ways to do this, but this should get you started. Note: We could likely parse the taxonomy from the FASTA file directly. But for now, it is much easier to import the individual taxonomy and sequence files separately into QIIME 2. We hope to have a builtin parser for GTDB within RESCRIPt soon.

biojack · May 18, 2022, 6:48pm

@SoilRotifer Thank you much, very clean and useful answer.

I will try go with *_ssu_reps_<release> as your suggested. I assumed that this file could be NR99, I just been consused that some genes just lost ( I mean totally lost, not just selecting one longest from all set ). Just as an example - there are no RS_GCF_003689455.1 gene

SoilRotifer · May 18, 2022, 6:56pm

Perhaps it failed some other quality control step or was clustered within another representative?

Likely a good question to ask the GTDB folks.

biojack · May 18, 2022, 8:25pm

@SoilRotifer good point, it would be better to ask GTDB support

One common thing I wish to clarify in the end. Is that problem in general for classifier if there some mismathes between tax file and sequence file ? I mean if I have gene in tax but I don't have this gene in fasta file or if I have multiple genes in fasta ( because of multiple copies as your explained yearly ). Does tax and fasta file MUST have absolute match between genes and therefore the same amount of records or this is not necessary for classifier learning correctly ?

SoilRotifer · May 18, 2022, 8:29pm

The sequence IDs can be a subset of the taxonomy IDs. Not the other way around.

biojack · May 19, 2022, 11:11am

@SoilRotifer got it. But what about repeating IDs? Classifier will be constructed normally if there some IDs just repeating in taxonomy or fasta? It's practical question because for example in all file from GTDB there are repeating IDs in fasta ( as I described in topic )

One more thing. I tried reps file and I totally dislike it. Because it has just only one sequence for each species, not only for each gene. For example there are only one sequence for s__Akkermansia muciniphila

but in tax file there are a lot of different genes

So looks like rep file just lost enormous amount of information. Much more than NR99 should lost. At least if we will look on NR99 file on Silva DB to check the same things

SoilRotifer · May 19, 2022, 2:17pm

Those IDs are not repeating. I noted my initial reply, here is the ID below. That is, the standard FASTA header format considers anything prior to the first space the full ID.
>RS_GCF_000213495.1~NZ_AFHD01000036.1

Yes, this is intended for the rep seqs, as outlined here.

Here is another approach to import everything for use as a classifier:

Download and extract full ssu file:

wget https://data.gtdb.ecogenomic.org/releases/release202/202.0/genomic_files_all/ssu_all_r202.tar.gz

tar -xvf ssu_all_r202.tar.gz

Extract and parse the FASTA header and write to file:

# Pull the header, keep the first two items (seqID and Taxonomy label), remove '>', and replace ' ' (space) with '\t' (tab)
egrep '^>' ssu_all_r202.fna | cut -d ' ' -f1,2 | sed 's/>//; s/ /\t/' > ssu_all_r202_tax.tsv

Then import as a taxonomy file:

qiime tools import \
    --input-path ssu_all_r202_tax.tsv \
    --type 'FeatureData[Taxonomy]' \
    --input-format 'HeaderlessTSVTaxonomyFormat' \
    --output-path ssu_all_r202_tax.qza

Then import the FASTA file as is:

qiime tools import \
    --input-path ssu_all_r202.fna \
    --type 'FeatureData[Sequence]' \
    --output-path ssu_all_r202_seqs.qza

Perform QA/QC through RESCRIPt if needed. Then build classifier:

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads ssu_all_r202_seqs.qza \
    --i-reference-taxonomy ssu_all_r202_tax.qza \
    --o-classifier gtdb_classifier.qza

Test classifier:

qiime feature-classifier classify-sklearn \
  --i-classifier gtdb_classifier.qza \
  --i-reads rep-seqs.qza \
  --p-n-jobs 4 \
  --o-classification taxonomy.qza

qiime metadata tabulate \
  --m-input-file taxonomy.qza \
  --o-visualization taxonomy.qzv

I just tested this locally and it appears to work. Let us know if this works for you too.

biojack · May 19, 2022, 2:43pm

right, but in Tax file there is only RS_GCF_000213495.1 id. So you need either change ID in fasta to RS_GCF_000213495.1 or change and multiply IDs in taxa file. First approach looks more elegant but then there will be repeated IDs in fasta

You suggested to generate taxa file from fasta description instead. It's absolutely fine if description in fasta totally corresponding to taxa file information. But I don't have such knowledge. Looks like that's true but IMO usage of original tax file would be a little bit safer. Anyway thank you much for your support

SoilRotifer · May 19, 2022, 3:02pm

Again, this has to do with how they are picking the representative 16S rRNA gene, as I mentioned earlier (e.g. species clusters etc..).

Why wouldn't this be the case? The taxonomy contained within the FASTA header is valid. If it wasn't they would not provide it within the sequence header. Many other databases follow this approach.

What do you base this on? The original taxonomy file serves a different purpose, it is a list of GTDB taxonomy for all bacterial genomes assigned to a GTDB species cluster, whereas the taxonomy in the FASTA header of the ssu file is for all the 16S rRNA gene copies in all genomes. Which is more inclusive. That is, you can think of the repset as a subset / species cluster version of the full ssu FASTA file.

That is, the individual archaeal and bacterial taxonomy files only contain 258,408 entires combined, where as the ssu file with all 16S rRNA gene copies contains 486,035 entries.

So basically, we have two approaches... 1) the repseq approach, and the 2) full approach. Neither is necessarily more valid than the other. It depends on what you are asking of the data.

biojack · May 19, 2022, 4:22pm

@SoilRotifer you absolutely right. My only concern was that tax file naming may be more valid or prepared for further processing ( like containing only two words on species level instead of full strain name or smth like this). Of cource if it is a general approach to produce taxonomic from fasta file header - I should also try it. I will run all file and will return if there will be smth interesting

SoilRotifer · May 19, 2022, 4:57pm

Ahh, gotcha!

Well, you can use rescript edit-taxonomy ... to fix those cases.

Here is one example use case:

biojack · May 23, 2022, 7:55am

@SoilRotifer Hi

Just need to note, that there is seems a mistake in your suggestion of constructing taxonomy from fasta

SoilRotifer:

# Pull the header, keep the first two items (seqID and Taxonomy label), remove '>', and replace ' ' (space) with '\t' (tab)
egrep '^>' ssu_all_r202.fna | cut -d ' ' -f1,2 | sed 's/>//; s/ /\t/' > ssu_all_r202_tax.tsv

for example there line in fasta

RS_GCF_001571485.1~NZ_LPTV01000250.1 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia flexneri [location=183..1721] [ssu_len=1538] [contig_len=5037]

corresponding tax line should be

RS_GCF_001571485.1~NZ_LPTV01000250.1    d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia flexneri

but after your script line in "ssu_all_r202_tax.tsv" tax file is

RS_GCF_001571485.1~NZ_LPTV01000250.1    d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia

so species name has been lost ( expected "s__Escherichia flexneri", got "s__Escherichia" )

SoilRotifer · May 23, 2022, 1:55pm

Good catch! Yeah, I wrote that quickly and "off the cuff"...

I forgot to add a 3 to the cut command. Basically, the cut command was returning only the first two strings separated by a space:

RS_GCF_001571485.1~NZ_LPTV01000250.1 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia

and not the first three:

RS_GCF_001571485.1~NZ_LPTV01000250.1 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia flexneri

This should work:

egrep '^>' ssu_all_r202.fna | cut -d ' ' -f1,2,3 | sed 's/>//; s/ /\t/' | head

I now get:

RS_GCF_001571485.1~NZ_LPTV01000250.1 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia flexneri

Though the command below might be a more clean and safe approach. It will remove, the >, then remove anything after the [ (space bracket) then replace the first (space) with a \t (tab). The other command is the same except uses [location as part of the search term.

egrep '^>' ssu_all_r202.fna | sed 's/^>//; s/ \[.*//; s/ /\t/' > ssu_all_r202_tax.tsv
# or
egrep '^>' ssu_all_r202.fna | sed 's/^>//; s/ \[location.*//; s/ /\t/' > ssu_all_r202_tax.tsv

This:

RS_GCF_001571485.1~NZ_LPTV01000250.1 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia flexneri [location=183..1721] [ssu_len=1538] [contig_len=5037]

Will become:

RS_GCF_001571485.1~NZ_LPTV01000250.1 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia flexneri

Again, I've not gone through and looked at all possible issues that can arise. But this can get you started. Just double check this text editor and confirm.

biojack · May 23, 2022, 2:30pm

I already created python script which converted fasta to tax. I wrote about issue to be sure that somebody else on forum will not going wrong with this.

I tried your last one

egrep '^>' ssu_all_r202.fna | sed 's/^>//; s/ \[location.*//; s/ /\t/' > ssu_all_r202_tax.tsv

output looks nice. Of cource as you I also cannot guarantee that any issue may arise. But look pretty nice

SoilRotifer · May 23, 2022, 2:36pm

Great!

We'll be implementing a more formal bit of python code within RESCRIPt to handle GTDB data. Hopefully, it'll be sooner rather than later.