GTDB classifer building and training

Kapaiden · October 26, 2021, 3:51pm

Dear QIIME2 community,
I would like to use the GTDB database for a 16S microbotia analysis. I downloaded the ssu_all_r202 file, imported it in qiime2 and extracted the V3V4 regions of the reads. I downloaded the ar122_taxonomy_r202 and the bac120_taxonomy_r202 files, merged them and then imported the output in qiime2. When I ran the qiime feature-classifier fit-classifier-naive-bayes command I got the error that "not enough values to unpack (expected 2, got 0)". This is obvious since the ID of the sequences aren't the same that the ID in the taxonomy file:

RS_GCF_001571485.1~NZ_LPTV01000250.1 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia flexneri [location=183..1721] [ssu_len=1538] [contig_len=5037]
TGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGCAGCTTGCTGCTTCGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAA

"RS_GCF_014075335.1" "d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia; s__Escherichia flexneri"

I can suppress all the data after the "~" sign in the sequence ID but some ID become not unique.

As Nicholas_Bokulich said in an answer to another post that "GTDB works well with QIIME 2, the files are already in an appropriate format" I assume that I miss something but I don't really know what.

The tutorial "Training feature classifiers with q2-feature-classifier" was so helpful for me that I wish the same with the GTDB or even the SILVA database!

Some help would be appreciated!

jwdebelius · October 28, 2021, 8:38am

Hi @Kapaiden,

Welcome to the :qiime2: forum!

I have good news and bad news. The GTDB files at r89 and before are formatted to work with QIIME 2. (There may still be tree issues, but let's burn building the classifier first.) It looks like the later versions don't have that specific set up. I've been working with r89 as a reference database for metagenomic annotation and it works pretty well for my fecal samples, but I am also limited by resources issues. So, I think my first suggestion would be to switch to r89.

Once you get the sequences imported, you can follow the training a feature classifier tutorial. If you're interested in Silva, you might want to explore the RESCRIPt tutorial for Silva specifically.

Best,
Justine

Kapaiden · October 28, 2021, 10:30am

Thank you for this clear answer! So I will switch to SILVA with RESCRIPt. Thank you again!

Nicholas_Bokulich · October 28, 2021, 3:31pm

Hi @Kapaiden ,

This may be a better/easier solution overall!

But you got me curious about this, and concerned that GTDB is suddenly not compatible. So I tested and as far as I can tell the current GTDB release is still compatible. Training the classifier (code below) works for me but I have not attempted to test classification.

I think your error message probably relates to this part of your workflow...

when merging you might be introducing a formatting error (e.g., you concatenate files and they are not separated by a line break, or introduce a blank line).

This error message is consistent with such a formatting issue:

This code works:

qiime tools import \
    --type 'FeatureData[Sequence]' \
    --input-path bac120_ssu_reps_r202.fna \
    --output-path bac120_ssu_reps_r202.qza 

qiime tools import --type 'FeatureData[Taxonomy]' \
    --input-path bac120_taxonomy_r202.tsv \
    --output-path bac120_taxonomy_r202.qza \
    --input-format HeaderlessTSVTaxonomyFormat

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads bac120_ssu_reps_r202.qza \
    --i-reference-taxonomy bac120_taxonomy_r202.qza \
    --o-classifier bac120_classifier_r202_q2-2021.8.qza

Kapaiden · November 3, 2021, 1:14pm

Dear all,

the answer of Nicholas_Bokulich gave me the solution : I merged the ar122_ssu_reps_r202.fna and the bac120_ssu_reps_r202.fna files for the sequences (instead of using the ssu_all_r202.fna file) and still used the fusion of the ar122_taxonomy_r202 and the bac120_taxonomy_r202 files for the taxonomy. The building of the classifier with these 2 resulting files worked ! Now I am happy that I could compare the data obtained with SILVA or GTDB database .

Thank you for help.

Nicholas_Bokulich · November 3, 2021, 4:54pm

Hi @Kapaiden ,
Any chance you could share your solution (code)? This will help others follow in your footsteps instead of the partial solution that I posted.

We plan to add GTDB support in RESCRIPt some day, so when that day comes it will be even easier to auto-format the GTDB database for use with QIIME 2.

Thanks!

Kapaiden · November 17, 2021, 3:33pm

Dear all,
here is the code I used to build the classifier. However it may be improved by adding a quality filtering step for example. Any suggestions are welcome.

cat ar122_ssu_reps_r202.fna bac120_ssu_reps_r202.fna > full_seq_GTDB_r202.fna

cat ar122_taxonomy_r202.tsv bac120_taxonomy_r202.tsv > full_taxonomy_r202.tsv

qiime tools import  \
    --type 'FeatureData[Sequence]'  \
    --input-path full_seq_GTDB_r202.fna  \
    --output-path full_seq_GTDB_r202.qza

qiime tools import  \
    --type 'FeatureData[Taxonomy]'  \
    --input-path full_taxonomy_r202.tsv  \
    --output-path full_taxonomy_r202.qza  \
    --input-format HeaderlessTSVTaxonomyFormat

qiime feature-classifier extract-reads \
    --i-sequences full_seq_GTDB_r202.qza \
    --p-f-primer CCTACGGGNGGCWGCAG \
    --p-r-primer GGACTACHVGGGTWTCTAAT \
    --p-n-jobs 2 \
    --p-read-orientation 'forward' \
    --o-reads V3V4_GTDB_r202.qza

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads V3V4_GTDB_r202.qza \
    --i-reference-taxonomy full_taxonomy_r202.qza \
    --o-classifier GTDBclassifierV3V4.qza

system · December 18, 2021, 9:34pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.