Sanity check on the rarefaction curve

I have ~200 samples and made a rarefaction curve to check where to cut the sequencing depth. This is the chart generated:

I have done 16S analysis in the past (many moons ago) and I have never subsampled sequences as low as 1000 reads. Is my data just crap? There seem to be low observed features too.
I had to play around with the DADA2 options as a high percentage of the reads were identified as chimeric. I changed the p-min-fold-parent-over-abundance to 8. Could this be causing the low read count cut off point?

Any help would be greatly appreciated. I feel a bit lost

Hi @newberrf,

Note the graph is showing unique feature counts (i.e. ESVs), not sequence / read counts. That is... some samples may only have ~40 unique features... but each of those features were likely sequenced 100s - 1000s of times (frequency).

For example, look at your feature-table summary QZV file, compare the value for "Number of Features" and "Total Frequency".

Thank you for the quick response. Would you mind clarifying something for me? I thought the sequencing depth (on the x axis) was the sequencing depth, aka. read count?
Setting the minfold4 parameter to 8, the number of features was 6,359 and frequency was 7,432,275.
I am just confused as to where to subsample my reads, 1000 seems very low.

Yep, that is the total for all the data.

Your plot goes up to 10,000 reads on the X axis! Basically, this plot, is saying that you are not detecting any novel ESVs as you increase sampling depth. That is, increasing your sequencing depth above ~1000 subsampled reads is not adding novelty, for the observed_features metric.

I thought the subsampling point was decided where the plot plateaus (as no new features can be detected). That's why I thought I should subsample at 1000 reads.

Thank you for explaining this clearly

1 Like

Yes, it should be used to guide the subsampling decision for sure. Given that all your samples reach 10,000 reads per sample, in this graph. It is always advisable to keep as many reads in your analysis as you can. If you had a sample where the line stoped at 2,000 reads, and everything else stayed the same, then you are likely okay to subsample to 2,000 reads with being detrimental to those samples that have more reads (as you've captured as much diversity as you can at that depth, and not adding novelty). Assuming you wan to keep those samples. See the example plot here.

Again, make sure to look at all the diversity metrics for this, as you my flat-line here but not with another metric, e.g. shannon, etc... Also I'd suggest re-making this plot to subsample all the way out the the sample depth of your largest sequence sample. You'll see the samples drop off, as linked above.

1 Like

Thank you. I had a couple samples with a read a count under 3000 so I removed them. I extended the plot out to 45,000 and they all levelled out at 5000 reads.

2 Likes