train database for V3-V4 region

elina2410 · October 24, 2023, 6:33am

I want to train my own V3-V4 database. I read turoial 'Training feature classifiers with q2-feature-classifier' but still I am not sure what to use when I want to train my own database. I know that I need the reference sequences and the corresponding taxonomic classifications. But which files from this site: Index of /greengenes_release/2022.10
are the reference sequences and corresponding taxonomic classifications?
I took 2022.10.backbone.full-length.fna.qza) and 2022.10.taxonomy.asv.tsv.qza but after training and testing database, my sequences were assigned only to Bacteria with any further classification. I assume I took wrong files..

qiime tools import --type 'FeatureData[Sequence]' --input-path dna-sequences.fasta --output-path 2022.10.backbone.full-length.fna.qza

qiime tools import --type 'FeatureData[Taxonomy]' --input-format HeaderlessTSVTaxonomyFormat --input-path taxonomy.tsv --output-path 2022.10.taxonomy.asv.tsv.qza

qiime feature-classifier extract-reads --i-sequences 2022.10.backbone.full-length.fna.qza --p-f-primer CCTACGGGNGGCWGCAG --p-r-primer GACTACHVGGGTATCTAATCC --o-reads ref-seqs.qza

qiime feature-classifier fit-classifier-naive-bayes --i-reference-reads ref-seqs.qza --i-reference-taxonomy 2022.10.taxonomy.asv.tsv.qza --o-classifier gg_2022_10_FL_trained_classifier.qza

qiime feature-classifier classify-sklearn --i-classifier ~/qiime-2023.09_database/gg_2022_10_fulllength_trained/gg_2022_10_FL_trained_classifier.qza --i-reads qiime_2_23/rep-seq-meta-dada2_223.qza --o-classification taxonomy_trained_gg_FL.qza

SoilRotifer · October 24, 2023, 1:57pm

Hi @elina2410,

Usually this happens when most, or all, of your sequences are oriented differently compared to your reference data. That is both your reference sequences and your data need to be oriented in the 5'-3' direction.

For more information see here.

One quick sanity-check, to see if this is the case, is to try feature-classifier classify-consensus-vsearch... as this approach does not care about orientation.

That being said, it might not be a good idea to construct a phylogeny as we're unsure if most, or all, of your reads are oriented similarly.

elina2410 · October 25, 2023, 4:53am

Hi @SoilRotifer
I ran classification with trained ready to use databases ( Data resources — QIIME 2 2023.9.2 documentation) both silva and gg, full and 515-806 to compare results. And I obtained much better classification eg:

qiime feature-classifier classify-sklearn --i-classifier /home/usr/qiime-2023.09_database/silva-138-99-nb-classifier.qza --i-reads rep_seqs_merged.qza --o-classification taxonomy_silva-138-99-nb-classifier

Is it a proof that my data are oriented properly?
Best
Ewelina

elina2410 · October 25, 2023, 12:11pm

I tried to train silva database using files below, from Data resources — QIIME 2 2023.9.2 documentation

and it worked well.

Now I checked that yestarday, when I tried to train GreenGenes database, I used taxonomy file which has a header, while in command I wrote -input-format HeaderlessTSVTaxonomyFormat . Maybe this is the case?
Sorry for my mistake..
Ewelina

SoilRotifer · October 25, 2023, 1:26pm

Looks like you have things working!

I have not used the new GreenGenes database much myself. Did you obtain any classification by simply using the full-length GreenGenes 2 database? I wonder if something went wrong at the feature-classifier extract-reads step?

Potentially. Just to be clear, are you saying that you got your V3V4 GreenGenes database to work?

elina2410 · October 27, 2023, 8:08am

Yes I obtained classification with default , ready to use trained database downloaded from Data resources — QIIME 2 2023.9.2 documentation

And yes I tried again to train gg db with the same files as before, and this time it worked.
So problem solved. Thank you for your assistance!

Here my results comparing GreenGenes database trained by myself (v3-V4 region) vs full length (default ready to use trained db downloaded from qiime site)
my own trained db (V3-V4)

full length db from qiime site

and Silva
my own trained db (V3-V4)

full length db from qiime site

I see better resolution after training my own database, both Silva and GreenGenes.
Thank you for your assistance!
Best,
Ewelina

SoilRotifer · October 28, 2023, 5:32pm

Awesome. I'm glad you got it working!

system · November 28, 2023, 11:33pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.