I have a redundant 16S V4 taxonomy which I used to create a classifier and run some data on qiime pipeline.
Then I fix redundancy with "rescript dereplicate" routine.
I used such command
qiime rescript dereplicate \
--i-sequences gtdb_seqs_V4.qza \
--i-taxa gtdb_tax.qza \
--p-mode 'uniq' \
--o-dereplicated-sequences gtdb_seqs_V4_derep.qza \
I got NR100 taxonomy with the same number of species that in original one, but with unique V4 sequence for each taxon.
Then I ran my standard qiime pipeline with new classifier on that NR100 taxonomy.
When I compared results I got that Beta-diversity between old and new pipelines are:
- 0.12 for L6 level
- 0.30 for L7 level
That's quite big diversity for running the same data. The running of same pipeline (new/new or old/old) gives for comparison only 0.03 diversity for both levels ( due to random rarefication )
So why results on new dereplicated taxonomy very different from old one? Looks like classifier works different. Is it the point and why? I expected that results should be pretty the same (like 0.03 diversity I mentioned above) and that redundancy should affect only time of processing but not the output.
Hi @biojack ,
Yes this is sort of the point — dereplication will also impact classification (or alignment) results. The difference should not be very large, but it reduces redundant hits that can weight classification (e.g., adjust the predicted probability of a given species just because there are many replicates in the database, not necessarily because that species is actually more likely to be observed).
What metric are you using? Is this in a QZV that you can share? I would recommend making something like a barplot as well to qualitatively compare, since the metric on its own might not be too informative for assessment.
Hi @Nicholas_Bokulich !
I'm using beta-diversity metrics for comparison, bray-curtis to be precise.
In fact I have two data-frames ( samples x OTUs ) for two approaches ( old and new ). Same samples in two datasets I compare with this beta-diversity metric. Then I averaging results over samples.
I'm using python library function call looks like
data_pair = beta_diversity("braycurtis", df_pair.values, ids=df_pair.index)
which return 2x2 matrix for the same sample in two datasets
Maybe later I could give some visualization to show the point, but I need to think what and how visualize
This is not an appropriate way to compare the effectiveness of various forms of a reference databases to one another. That is, neither the various taxonomic groups, nor their various rank-levels (e.g. domain, phylum, ...) contain the same number or diversity of taxa.
For example, one genus may contain 100s of known species, and another genus may contain only 10 known species. Thus, when you collapse your features by taxonomy from species to genus, you are drastically underrepresenting the genus with 100s of species (and underestimating diversity). Going the other direction, you will inflate diversity going from genus to species. This fits with the results you are seeing when comparing L6 to L7.
This approach can also be heavily affected by the taxonomic nomenclature used. Some taxonomic rules may split taxa into different groups, and others will lump them together. Finally, in the last several years we have seen drastic changes to microbial taxonomy.
Thus, your approach heavily skews your estimate of diversity and is not reproducible. As the lumping, splitting, and renaming of taxa, might be quite different a few years from now. This is why ASVs / ESVs are typically used estimate diversity for microbial surveys, because they will highlight slight differences in sequence even though they have the same taxonomy.
I recommend using some of the evaluation tools in RESCRIPt which can help you evaluate the effectiveness of your classifiers. These outputs should help you decide which classifier might be appropriate for you. Links to the tutorials can be found here. I also recommend reading the associated RESCRIPt manuscript. We outline here how clustering, and by extension, dereplication can affect the reference databases.