Hello, I am trying to determine how richness changes as a function of O2 availability in a marine setting. However, I am unsure whether I should use observed features or Chao1 as my richness metric. I am also confused at the different stories these metrics are telling me -- why are they so different?
For example, let's look at how each metric correlates with O2 using qiime diversity alpha-correlation.
First off, the numbers on the y-axis in both plots refer to the number of taxa (ASVs) in each sample, right?
I think I understand how some samples in the Chao1 plot have higher values -- that means we missed some of the predicted diversity in our sampling, right?
Why do some of the samples in the Chao1 plot have such low values (e.g. 1 or 2)? And why do these samples not show up in the observed features analysis (e.g. Sample ID: MAR15_150_PF). Is it really likely that these samples only have 1 taxon?
Anyway, any help in sifting through all this is much appreciated!
If you are generating ASVs using default settings, then using Chao1 will be misleading. Remember, many of the default approaches remove singletons. Therefore using Chao1 is invalid as there are little to no singletons work with. If you'd like to make proper use of Chao1, then you'd have to try to retain as many singletons as you can by adjusting your denoising options, which may come at the cost of retaining spurious sequences. But I generally avoid this metric. However, others may have more knowledge on how robust Chao1 is to lack of singletons in the data, etc. You mileage may vary.
Also, Chao1 is a richness estimator. That is, it assumes that rarely sampled species (i.e. singletons, doubletons), provide more information about the unknown number of unobserved species. That latter part is why Chao1 is called a richness estimator. It assumes you've not sampled everything, and is trying to estimate what might actually be present.
Whereas "observed features" is simply showing you the number of features.
@DBM, One of the other moderators reminded me of the following:
"dada2 does not throw away singleton reads. However, it does not infer amplicon sequence variants that are only supported by a single read".
I also forgot to mention that one potential way to retain more inferred per-sample ESV singletons is to apply --p-pooling-method pseudo --p-chimera-method pooled, as a read may exist as a singleton in a given sample but may appear multiple times across the entire dataset. These options should help. Again, others may have additional thoughts and can correct me, if I am wrong on this.