Hi, all
I want to try use GTDB files for building classifier on 16S biome analysis. I'm bit confusing what files exactly should I take
I'm interested in 202 release of GTDB
So it looks like for taxonomy I should take files ar122_taxonomy_r202.tsv and bac120_taxonomy_r202.tsv
and sequences from ssu_all_r202.tar.gz
But what confusing me is some unmatching of taxonomy and sequence tags. For example in Tax file I have line
RS_GCF_000213495.1 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia flexneri
And in fasta file I have lines
>RS_GCF_000213495.1~NZ_AFHD01000036.1 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia flexneri [location=3567..5105] [ssu_len=1538] [contig_len=293796]
and
>RS_GCF_000213495.1~NZ_AFHD01000041.1 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia flexneri [location=3566..5104] [ssu_len=1538] [contig_len=142194]
So in fact for one record RS_GCF_000213495.1 in taxonomy file I have multiple records in fasta file. What should I do with them? Remain only random one or could remail all repeating IDs? Or probably I took wrong (not best) files from GTDB ? Also do they have NR99 version of sequences or this routine should be on user side?
I also looked at file bac120_ssu_reps_r202.tar.gz
it looks like keeping non-repeating IDs, but some IDs are lost there
Thank you much for your attention