Many questions about creating my own Taxonomy classifiers

KonradV · November 7, 2023, 1:31pm

I read the following documents：
Training feature classifiers with q2-feature-classifier — QIIME 2 2023.9.2 documentation
Data resources — QIIME 2 2023.9.1 documentation

Since my samples are composed bacterial communities, as I understand it, isn't it possible to create Taxonomy classifiers.qza files that contain only my bacterial communities?

The standard I am using is the D6305 product from zymo research. They provide a full set of 16s sequence fasta files and genome sequence fasta files. But I opened them and found multiple 16S sequences per species, which is normal but bothers me.

My first question is that when I look at the documentation for teaching, which is the 85_otus.fasta file and the 85_otu_taxonomy file, a classified species correlates to only one feature sequence. So how do I make a species correspond to multiple sequences? It's how do I make use of these fasta files that have multiple sequences.

My second question is, and it may be a bit out of place here, please ask databases like greengene and silva where they source their data from. I would only use NCBI to look up the information I am interested in. As an example, there are tens of thousands of E.coli genome sequences on NCBI, which one do greengene and silva use as their reference sequence?

One last question, I stumbled to find that the rrnD sequence of E.coli-K12 is very vastly different from the sequences of the other 6 16S rrn's! Wouldn't it be more appropriate that each species should correspond to multiple 16S RNA sequences? Or is it actually the case that everyone is already doing this already.

SoilRotifer · November 7, 2023, 7:48pm

Hi @KonradV,

Let's see if we can help you out.

Yes it is, but this is not a good idea. Quite often off-target DNA is sequenced, for example, mitochondria, chloroplasts, host DNA, etc... Having "out-group" or "decoy" sequences that are not Bacteria, will help you identify sequences that are not bacteria. Then you can remove them.

Without these out-group reference sequences, much of your data might be spuriously classified as being "Bacteria" when they might be DNA from non-bacteria.

Why is this bothersome? There is inherently natural variation in DNA sequences within any given species. In fact there are many microbes in which the multiple copies of 16S rRNA can vary substantially, even beyond the traditional 97% similarity cutoff! Intragenomic variation, as outlined here and here should be considered.

Having several reference sequences, per taxon, will ensure that your classifier will be better able to identify your sequences.

Do not use these files. These files are for example only, and allows the tutorial to be run quickly. By clustering the reference data at 85%, we drastically reduce the file size and run time. This is why there is only one representative sequence per species or genus, etc... You should be making use of the premade files on the Data resources page.

Alternatively, you can make use of RESCRIPt to curate your own reference database. Checkout the tutorials.

This information is provided in their respective links below. Often, the GenBank ID, or any other valid ID from the International Nucleotide Sequence Database Collaboration (INSD), is used.

Greengenes2
- paper
- website
SILVA:
- paper
- website

See my earlier comment on intragenomic variation. But generally, yes. Most use reference sequence-taxonomy databases in which there are many sequences for a given taxon.

system · December 9, 2023, 1:53am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.