I am trying to understand how qiime2 transforms my features from dada2 to taxonomy labels using the taxa plugin. For example, when im testing my mock community i get 36 features in the taxonomy.qzv but my taxa-bar-plot.qzv gets only 25 taxonomy labels. i tried to read about this in the forum and taxa doc, but im not sure i fully understand the "collapsing" process. I also tried to cluster the features to OTUs using 2 (de-novo | Open ref) different methods in this tutorial: Clustering sequences into OTUs using q2-vsearch — QIIME 2 2019.1.0 documentation
To try to get the same results as in the taxa-bar-plots. i got identical results in the csv files.
but my feature table shrinked from 36 to 31 (when i tried 99% OTU ) or to 29 (97% OTU).
So i would love to know how it works!
The second questions is bit more tricky, but i think it might help others as well.
When i used the dada2 in the denoising step i raised the MAX error rate to 5 (default 2) because i was losing 30-50% of my reads and now i lose only 5-10%. i just wanted to know that this is "legal" and i can continue with my analysis.
You have 36 unique sequences in your data, and so the taxonomyDevee5_Trim6.qzv visualization contains 36 entries. However, even though the sequences are unique, some have the same taxonomy classification: there are only 25 unique taxonomy classifications and hence what you see in the barplot.
ITS is hypervariable, and even an individual cell will have multiple copies that may even be different from one another! So having redundant taxonomy labels is never a surprise, but especially for ITS data.
Taxonomy assignment and OTU clustering are tangentially related, and the 97% cutoff does not necessarily define species, especially not for fungi.
In the future please ask distinct questions in a separate topic; it makes it easier for others to follow that discussion.
Sure, this is "legal" but you should make sure to clearly state this when reporting your results, e.g., in publication. Reviewers may want some justification. Effectively, you are accepting more erroneous reads and telling dada2 to clean them up for you. Dada2 will oblige, though to my mind the more error you permit the greater possibility dada2 will make mistakes.
Yes, i understand that only 25 unique taxonomy labels fit to 36 features, and this is great.
But what is the similarity threshold ? is it 99% like in OTU clustering or is it based on different ways of comparisons , Where can i read all the details about “Taxonomy labeling”?.
You are really comparing apples and oranges; OTU clustering is essentially not actually related to taxonomy (in spite of the acronym). You can read about the taxonomy classifiers used in QIIME 2 here.
After reading your article and Dada2 documentation, i feel i understand taxonomy classification a bit better . On the same topic; im trying to understand why some taxonomy labels are stuck at the family or genus level (or even lower). when i take the sequence that got stuck and i blast it, i can immediately see the results match the species level. Is it because of the classifier confidence level, that wont “allow” the next taxonomy level to push through if its not confident enough? if so, what is my option to improve my taxonomy labeling and also avoid over fitting.
as you surmised, that sequence cannot be confidently classified to a lower level. The reason being that other reference sequences are very similar (or identical) but have a different taxonomy label.
Of course: NCBI BLAST will always report the top hit(s), including their species ID, but that is just the top hit not necessarily the correct ID. If you peruse the BLAST results, you will most likely find that several other top hits have equal or similar degrees of similarity with your query, and may have different species IDs. Relying on top-hit BLAST matches is not a wise approach for classifying taxonomy of very short DNA sequences because there are in fact many matches/similar hits.
The reference database could be one issue, but difficult to resolve!
I am using the latest version of UNITE developer database as recommended in the official site for ITS classification. Retraining the classifier and adding weights sounds very interesting, i will definitely try to understand and implement this in my work!.
thank you so much Nicholas, you are super kind and helpful !
I see... in theory you should be able to assemble taxonomic weights for ITS, depending on what sample types you are using and if previous studies exist, e.g., in QIITA, or other information on the taxonomic composition of your samples. However, that is the weakness of this method: you need to have pre-existing information on what species you expect to discover. While pre-existing data are common for 16S data (e.g., on QIITA), fungal communities have not been as commonly explored for many sample types.
Yes, that is true. There are a few papers that tried analyzing fungal communities, i can maybe try using their findings as weights.
Will it be possible to use the taxonomy i got from my uniform classifier and try to give weights according to my findings? im sorry if its a stupid question, i haven’t read the entire documentation yet.
lastly, giving weights to lets say my top 5 most expected fungi will increase the chances to better classify them on the expense of other more rare or unexpected fungi?
Good question. I would discourage that — effectively it would be overfitting (and besides, the uniform classifier is not performing well so would just propagate that issue)
That would be the outcome yes, but I would discourage manually tweaking the weights it is unknown how generating artificial weights would impact the classification results. You could wind up misclassifying other taxa that are present if they are artificially downweighted.