What are differences for the files in Silva database which used to train the classifier?

Hi all,
I am have some questions want to ask and confirm regarding taxonomy and feature-classifier:

  1. What are differences for the files in Silva database which used to train the classifier?

image
I am able to under the differences of 7 level and all levels after searching in the forum. But I am confused what are the differences between"consensus_taxonomy_7_levels" and "taxonomy_7_levels". Which file I should used for training the classifier if I want to do 7_levels classification. Can you please give me some recommendations?

  1. I want to confirm my understanding regarding the fit-classifier-naive-bayes.
    Based on my understanding , if I decided to use this function construct a pre-trained classifier, I cannot use the files which end with "all_levels" as shown in the above picture since it has an uneven number of taxonomic levels and this can confuse the classifier. I should always choose the "7_levels". Please correct me if my understanding is wrong.

Thank you so much for your help!!!

1 Like

Hi @Lei,
It looks like a README file released as part of the SILVA 132 release explains these files (note we do not have anything to do with the release/maintenance of SILVA). This is straight from the README:

A user of the Silva119 data pointed out that the taxonomy with the SILVA119 release is based only upon the taxonomy string of the representative sequence for the cluster of reads, which could lead to incorrect confidence in taxonomy assignments at the fine level (genus/species). To address this, I have endeavoured to create taxonomy strings that are either consensus (all taxa strings must match for every read that fell into the cluster) or majority (greater than or equal to 90% of the taxonomy strings for a given cluster). If a taxonomy string fails to be consensus or majority, then it becomes ambiguous, moving up the levels of taxonomy until consensus/majority taxonomy strings are met.

The meaning of "taxonomy_7_levels" is not explained, but from that description I would assume that this taxonomy consists of the taxonomic affiliation of the representative sequence for each cluster of reads.

Which one you use is up to you — I would recommend consensus or majority instead of raw or "taxonomy_7_levels".

Correct, this will cause an error with classify-sklearn unless if you set confidence=-1 (which will just choose the top hit so is not recommended).

I hope that helps.

2 Likes

Hi @Nicholas_Bokulich,

Thank you so much for your quick response and explanation. I previousely saw in other thread that you mentioned that we need to use 7_levels instead of using the all_levels . At that time, I thought you mean the "taxonomy_7_levels" and I used that this file to train the 18s classifier. I think I need to use the consensus file to re-train my classifier based on your suggestion.

I open the two files ("consensus_taxonomy_7_levels" and “taxonomy_7_levels”). Please see the image below:


These two files has the same rows (55145) however, the order of the name in first column is different. The content in the second column under the same name of column 1. Just from the files, it is difficult to see what are the exact differences between two files.

I have follow up questions regarding the classifier:
1) Regarding the method selection: If I really want to get all_levels taxonomy, I know that I need to use other method. Among the available method in Qiime2 which can handle this: classify-consensus-blast and classify-consensus-vsearch, which one do you recommend for the 18s data?
For example, if I want to use classify-consensus-vsearch method, what file I should prepared. I have the following file available:

In the documentation file, the following files are needed:


The ARTIFACT required should be the qza file, but I only have the "fna" and "“txt” file which obtained from the SILVA database. How can use these file to make ARTIFACT which are needed for the command?

2)Regarding the classify-sklearn method: If I have one dataset with 10 fastq files. The sequence in 5 of the files are from 5'' to 3''. The sequence in the other 5 samples are from 3'' to 5''. I used the classify-sklearn method to do taxonomic classification for the files which generated by DADA2 (contain all 10 samples). Only 5 of the samples has high resolution taxonomy. But the other 5 only assigned to Bacteria kingdom (However, the sequence read in these samples is not low (>10000)). Do you think this is because the classify-sklearn is confused with the order of sequence in the samples? My friends have used Qiime1 to analyze this dataset and there is no such issue happen. All of the 10 files got high taxonomic resolution. Can you please think of any reason to make this happen?
Can you please also help me to solve this problem if I still want to use Qiime2 ? Do you think it is necessary to use some tools to flip the sequence by some tools (if available; but I don't know which tools can do this work) to make all of them from 5'' to 3''?

Thank you so much for your patience and your kind help!!!

1 Like

The two files are different. Look for "Ambiguous_taxa" labels in the consensus taxonomy and compare to taxonomy_7_levels... the ambiguous taxa occur in consensus taxonomy because there is not consensus at that level, but the fill label is listed in taxonomy_7_levels because (presumably) it is the cluster rep seq.

"all levels" presents problems for really any classifier, since the taxonomy becomes a knotty mess (and this is not an issue specific to q2-feature-classifier either). The BLAST and VSEARCH-based classifiers will have the same issue unless if you use maxaccepts=1, which is still going to grab the top hit (like classify-sklearn confidence=-1) so is sub-optimal.

The best solution is really to use the 7-level taxonomy if you can...

You need to import to QIIME 2 — see the feature classifier tutorial on qiime2.org for specific examples.

Dear oh dear — mixed orientation is bad news and not just for taxonomy analysis. dada2 is effectively going to duplicate all ASVs, because the reverse complement of any ASV is a new ASV. Make sense? That's bad news for all analyses, especially if the samples are stratified by orientation.

Yes — classify-sklearn looks at the first 100 or so seqs to decide the orientation, and classifies based on that. Your mixed orientations leave it confused :confused:.

You have a few solutions. Fortunately, it sounds like your samples are stratified by orientation (e.g., sample 1 is all in forward orientation and sample 2 is all reverse) . So you could:

  1. [BEST] reverse the orientation of any reads in the reverse orientation and proceed (starting with dada2).
  2. classify your sample sets in two sets, separated based on read orientation

But perhaps I misunderstand and all samples are in mixed orientations, in which case use classify-consensus-vsearch, which can already handle mixed-orientation reads.

Yes! VSEARCH comes pre-installed with QIIME 2 and has a method to reverse read orientations. This will only be useful if your read orientation is stratified by sample, not if all samples are in mixed orientations.

1 Like

Hi @Nicholas_Bokulich,

Thank you for your detail explanation. Now I have better understanding of my questions.
1.

After you give me this specific examples, I was able to understand their differences completely. It looks like the 16s pre-trained classifiers which provided in the Qiime2 data resources website Qiime 2 are using the "taxonomy_7_levels" file instead of the "consensus_taxonomy_7_levels" and I have used it for my 16s data taxonomy classification. To make consistent, I might keep using my old 18s classifier which trained by using "consensus_taxonomy_7_levels". Is it make sense to you?


2

I am working on both 16s and 18s data. I am satisfied with the 16s result by using the "7_levels" files. But 18s I am really not sure how the taxonomy of Eukaryote works and what is the best way to present the results. It has so many taxonomy rank compare to bacteria. So I thought putting all of the result in the some level might be better for interpretation purpose? I want to create Phyloseq object which require to specify the name of each taxonomic rank. If I use the 7 level, I do not know what name I should give if I go further than level 3 for the 18s data.

3

I used to follow the classifier tutorial to train my Bayesian classifier for 18s data several month ago. After refreshing my mind, I was able to remember how to import the files to qza files.
4

It make sense to me. DADA2 used extra long time to process all the 10 samples due to orientation problems. I understand other download stream analysis like beta diversity can also be affected since the distance matrices might not be correct under this conditions.
You mentioned that I can either classify my sample sets into two sets and run the classify-sklearn separately or using the classify-consensus-vsearch to solve this issues. However, I still cannot run the beta diversity analysis if I did not correct the orientation problem, right?
5

Do you mean the vsearch plugin for clustering and dereplicating the sequence? Did you suggest to use this method instead of using the DADA2 method? If I use vsearch method for clustering, do I still need to concern downstream taxonomic classification and beta diversity issue due to the orientation problems? For example, can I use classifier-skearn to do taxonomy after using the vsearch method?
6

If I still want to use DADA2 for denoise, as you recommend, I need to reverse the orientation of the reads for the raw fastq. But how can I do that for the fastq file. Can you please let me know by using which tools I can change the orientation of the fastq file.
Thank you so much for sharing your time to help me :smiley:

Yes by all means be consistent in your own protocols.

sorry not sure — I am not a eukaryote taxonomy expert. You may want to consult SILVA directly (or just google the names at that level to determine what taxonomic rank it is).

You could run it but the results would be awful if your samples are stratified by read orientation. If all samples have mixed orientations, you should be fine (though the # of unique features will be wrong this will impact all samples evenly), but if some are forward and some are reverse orientation it will be chaos.

No not q2-vsearch and no I do not mean that you should use OTU clustering instead of denoising. VSEARCH is an independent software package that does all sorts of useful things! QIIME 2 wraps VSEARCH to do some things and make more complicated procedures (e.g., q2-feature-classifier's classify-consensus-vsearch uses vsearch for database searching but then does other stuff on top of that to create a new method that is not available in VSEARCH on its own). But VSEARCH has so much more to offer — see here for more details and a manual about using VSEARCH.

See the VSEARCH manual. You want to use the vsearch --fastx_revcomp command to reverse complement all reads in the samples that have reads in the reverse orientation. Do this on the fastq data before importing to QIIME 2, and first you should make sure that your samples are stratified by orientation, not all in mixed orientations.

Good luck!

1 Like

Good morning @Nicholas_Bokulich,

Thank you so much for your reply. It solves all my questions. I will first give a try by using the VSEARCH software.

Have a nice day!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.