observed OTUs and number of reads

Alex · December 17, 2019, 4:43pm

Hello guys,

Is there any rule such as "that the observed OTUs are correlated with the total read count" i.e, more observed OTUs/ASVs with more reads?

For instance, from the tutorial by Nicholas ggplot shows a clear positive correlation, however, I wonder if this trend is the same with every dataset. My dataset suggests another way around

Nicholas_Bokulich · December 17, 2019, 5:14pm

generally yes there is a positive correlation between sequencing depth and # of unique sequences (hence why rarefying or other normalization is required prior to comparing samples), but this relationship is not an absolute rule (since the # of uniques depends on characteristics of the samples, not sequencing depth alone).

Your data look suspicious in that regard... was the x-axis flipped??? But the relationship you are seeing is not impossible, e.g., if the highest-richness samples happened to have the fewest reads and the lowest-richness samples the most reads.

Peter_Kos · December 17, 2019, 9:13pm

There is indeed a kind of positive correlation.
It is like: if you have too few reads, you are less likely to capture all species present, and you are likely capturing less species.
If you turn this rule around, you will find that the more reads you have, the more likely it is that you will capture all species present. But not more than those. So if you increase the read number, then after a certain number you should not be able to see more species/ASVs/OTUs then those present in the sample.
Hence the alpha diversity rarefaction plot (that mimics the various depths of sequencing) goes to saturation. Beyond that point there is not such relationship as even if you double your sequencing depth you will not find new OTUs. You have already seen all that were there.

Alex · December 18, 2019, 10:50am

Thanks for your reply

No, the x-axis was not flipped. I used the following code

ggplot(data = data.frame("total_reads" =  phyloseq::sample_sums(physeq),
                         "observed" = phyloseq::estimate_richness(physeq, measures = "Observed")[, 1]),
       aes(x = total_reads, y = observed)) +
  geom_point() +
  geom_smooth(method="lm", se = FALSE) +
  labs(x = "\nTotal Reads", y = "Observed Richness\n")

Now I am thinking that it is possible. The data I presented is from different sponge species and they can have different abundance/diversity pattern and in fact, they are classified into low-microbial abundance (LMA) and high-microbial abundance (HMA) sponges.

Nicholas_Bokulich · December 18, 2019, 5:19pm

Okay that makes sense, though the share of your plot is still quite surprising. Maybe you should color the plot by species to see if the high-abundance species have the lower read depth, just to confirm your assumption here.

Maybe also do this the normal way in QIIME 2: as @Peter_Kos mentioned, use alpha rarefaction to see how read depth impacts observed richness. The qiime diversity alpha-rarefaction visualization will allow you to group samples by species so that you can see how read depth impacts richness in these two species separately.