Sequencing Depth in Alpha Rarefaction Curve and Feature Counts in Frequency Table (table.qzv) do not match?

Hello guys. Firstly, happy holidays! I hope everyone is well and safe.

I have this particular concern hoping someone can answer me. I'm kind of troubled when I came across comparing the sample with lowest feature count in my table.qzv (around 58,000 feature counts) but when I visualize the table through alpha rarefaction curve observed_features (using the command qiime diversity alpha-rarefaction where --p-max-depth is set at the maximum number of feature counts I have from the table.qzv which is around 100k feature counts) the sequencing depth shown by alpha rarefaction curve in this particular sample do not match with the feature count shown in table.qzv where this sample has 58,000 feature counts but in the rarefaction curve, the sample has sequencing depth at around 43,000 when I check the observed_features.csv file. I'm assuming here that the feature counts listed in table.qzv corresponds to the sequencing depth (the x-axis) in alpha-rarefaction curve. Is this normal? Are they computed differently? How can I explain the discrepancy?

This is the alpha rarefaction general command I performed:

qiime diversity alpha-rarefaction \
--i-table table.qza \
--p-max-depth INTEGER \
--m-metadata-file metadata.tsv \
--o-visualization alphararefaction.qzv

where INTEGER is the highest number of feature count in table.qzv

I am using and running QIIME2 v2021.8 in Ubuntu Oracle Virtual Box

Thank you!

P.S.
Edit: P.S. I tried putting --p-min-depth INTEGER where INTEGER is set at the lowest number of feature counts (which is 58000) and I can still see the sample. But if I increase the min-depth to from 58000 to 58,001 as the --p-min-depth INTEGER , the sample disappears from the list which make sense to me if I base the feature counts in table.qzv since it's the feature count (58,000) is below the cut-off value (58,001 in this example). I kind of don't understand how rarefaction works in this sense and I'm not stat savvy so I'm not sure if I completely understand the jargons in the documentation and some of the forum posts. Some tutorials I've seen do not seem to address this. Nonetheless, my samples are plateauing.

Another Edit: I tried setting the --p-max-depth INTEGER at 58000 and all samples appear. So now I'm kinda confused how sampling/sequence depth are calculated because if I view the 2nd line graph when I set the INTEGER at the highest number of feature counts, it tells me that I will exclude this sample if I exceed 50,000 but when I set the INTEGER at 58000, all samples still appear.

After playing around with the parameters, it seems that the sampling depth is dependent on the --p-steps INTEGER parameter wherein not all feature counts get accounted depending on the intervals when indicating the number of rarefaction depths (don't take my words for it, just a guess on how it works! :smiley: ). After increasing the INTEGER from default 10 to 100 (50 works, too!), it shows better depths and is reflective of my frequency table. Though the increase of steps also increases the computation time but the line curves look way better as it shows more distinctly where the curve ends and where the plateau starts.

1 Like

Hello @rmbn,

Good job! I think you pretty much figured this one out by yourself. :fireworks:

Here's the key section from the moving pictures tutorial that describes how alpha rarefaction plots work:

The visualization will have two plots. The top plot is an alpha rarefaction plot, and is primarily used to determine if the richness of the samples has been fully observed or sequenced. If the lines in the plot appear to “level out” (i.e., approach a slope of zero) at some sampling depth along the x-axis, that suggests that collecting additional sequences beyond that sampling depth would not be likely to result in the observation of additional features. If the lines in a plot don’t level out, this may be because the richness of the samples hasn’t been fully observed yet (because too few sequences were collected), or it could be an indicator that a lot of sequencing error remains in the data (which is being mistaken for novel diversity).

Here's how I would summarize it;

The more you look, the more you see. So of course you will see more unique microbes in a sample with 1 million reads than a sample with 1 thousand. But does that really mean the sample with 1m reads has more microbes in it?

If you randomly took 1,000 reads from each samples, you would be comparing them on equal footing (1k reads each, equal sampling depth, equal resolution, etc.) and that would be a fair comparison of their alpha diversity, :apple: to :apple:

Have we see it all, or should we keep looking for more?
But do we really need 1 million reads to observe the full diversity of microbes in a sample? Why not 500k? Is 50k good enough? Having only 1,000 reads in a sample sounds very small, but maybe that's enough to capture every microbe in a low diversity environment. :thinking:

For samples with only 1k reads, we can't know what microbes we are missing. But for samples with 1m, there is a way to simulate fewer reads and see how many microbes we would lose if we had less data! By randomly subsampling our 1m observations down to smaller numbers, like 500k, 50k, and 1k, we can see how many microbes we observe at these lower levels, and when we have 'enough' reads to see all the microbes there is to see in a sample.

I hope this helps! Happy holidays :christmas_tree: :dove: :clinking_glasses:

1 Like

Thanks, Colin! @colinbrislawn I didn't notice the discrepancy I found until I started putting the curves chart into my final manuscript and while describing the rarefaction curve, my lowest frequency in table.qzv was in 58000 while noticing the curve for that specific sample barely passing the 40k sequencing depth mark along x-axis. I thought I was using the wrong frequency table because I've had several filtering steps before moving in to the downstream analysis and was afraid I might have mixed them tables up (but I have pretty effective naming conventions with my files and artifacts so I know 90% it wasn't the problem! :D). It turns out the --p-steps INTEGER can help me get into these sampling depths that we cannot previously see in the graph. I think it was a blessing in disguise because the curve looks way so much better when increasing the steps and it depicts more clearly that all my samples are plateauing and that rarefying them at 58,000 reads will still yield basically a decent OTU coverage.

Happy holidays and thank you once again! <3

1 Like