I am trying to replicate the results elaborated by another research group on a taxonomic analysis of the 16S, but I keep getting completely different percentages than theirs.
Version 132 of Silvadb, Archive , was used, using the files / sequences corresponding to 99% identity, using the q2-feature-classifier.
For the analysis, I performed the following commands as read on the tutorial "Training feature classifiers with q2-feature-classifier" Tutorials — QIIME 2 2021.8.0 documentation
Since as reported in a previous topic SILVA 16S database , I managed to get the percentages identical to those provided using kraken2 and the specific silva16S database, I deduce that the data provided are correct, and I am continuing to make something wrong in using qiime2.
This is a really broad, open-ended question, that I'm not sure we can specifically answer - at least not with the current summary provided.
High-level, the workflow you've shared looks reasonable to me, and I don't see any immediate flags. I think if you're observing discrepancies with your colleagues you should take a step back and figure out all of the individual steps in the two workflows, and ensure that you're applying the same kinds of steps at the same points.
If you have any more specific questions we will be glad to take a shot at answering them.
However, I cannot understand how it is possible to have such big differences using different databases, and totally different results by selecting part of the same database. Either depending on the database used you can get totally different results randomly, or there is an error in my procedure for creating the classifier.qza (I think it is MUCH more probable).
At this point my specific question is: what could I have done wrong in creating the classifier.qza?
I'm not sure I understand the question. Wouldn't you expect different databases to produce different results?
That question has been discussed extensively elsewhere on this forum, I encourage you to to have a look around. In particular, you might want to have a look at the rescript plugin. If you have any additional or new questions, please open a new topic or topics.
I'm not sure I understand the question. Wouldn't you expect different databases to produce different results?
Using different databases, I expect that there is a deviation from the value of an acceptable percentage (1-5%), not that a microbial class goes from 20% to 60%!
Are such large differences justified by a different database? For this I suspect that I am making mistakes in creating the database
I tried to contact the supplier of the analyzes, their procedure and their parameters remain top secret (not being able to replicate the results is unscientific! )
If the above procedure is correct for you, can I consider it as valid as theirs, and are the different results justified by a different database even if totally different?
following your tutorials I don't seem to have made mistakes, are there any parameters or checks that I missed that could have compromised the results?
To me that sounds like it has less to do with the database used, and more to do with the upstream processing steps appliedto your actual dataset, like QA/QC etc.
Please see this resource, courtesy of @jwdebelius: