How to build the greengenes equivalent db for 12S rRNA


As the subject says, I need to create the equivalent of greengenes ref database, but for 12S.

I’ve looked at qiime2 documentation and it’s not very explicit ( ).

The 1st step was to get all needed mitochondrial genomes ( ) and their taxon ids.

Next, I should build the OTUs and taxonomy. Here I think I should extract the 12S region from each genome, but I’m not quite sure how to approach this.

Once I have the OTUs I can use scikit-learn to build the classifier.

I’m quite new at this. Any hints would be appreciated.

Thank you

Hi @Cornel,
Sounds like quite an ambitious project.

q2-feature-classifier assumes that you already have a fasta file of sequences ready to use for database construction. It does not document how to re-generate greengenes because it does not assume (nor require) that level of effort. You could simply provide it with a fasta of full-length 12S gene sequences and let it do the rest. @BenKaehler anything to add?

You could also just pull 12S gene sequences from genbank, instead of full mt genomes.

That’s described in this section of the tutorial. You just need primer sequences for either end of the domain of interest.

OTUs will speed things up but are not necessary. You could do this as quickly and easily as just downloading all full-length 12S sequences from Genbank and following the tutorial instructions.

If you really want to create the 12S equivalent of greengenes, you should contact the greengenes authors about that — they did a lot more that the steps you describe, particularly for handling taxonomies. When collapsing into OTUs you need to decide how you handle conflicts in taxonomy labels between different members of a cluster. So just using full-length 12S reference sequences should be adequate and easier, actually, than the steps you propose.

Hope that helps. Let us know if you have any further questions.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.