Normalisation Seq Count

Dear Qiime2 Expert,

As far as I know, we need to rarefy our sample depth when performing diversity analysis. How about the rest, e.g: taxonomy analysis? Should I normalize my seq table and frequency table right after DADA2 analysis?

I provide the rarefaction curve for a clearer picture pertaining to the above-mentioned questions:

This rarefaction curve was plotted using the value somewhere around median. Is this plot perfect for publication or should I rarefy to the equal sequence count for each sample? i.e 3000.
My other concern is whether should I exclude the low seq count samples in the analysis though they achieve the plateau curve in the rarefaction curve? (i.e. sfF4, blue colour sample)


1 Like

Hi @Benedict

If by taxonomy analysis you mean doing differential abundance tests to detect changes in individual taxon across groups or a gradient, then tools like ANCOM, gneiss, ALDEX2, or Songbird operate on relative abundance data, so would not require you to rarefy. However, that doesn’t mean you should include all samples. You should still filter samples with very low reads. What is considered very low reads? Well that depends on your sample type. If you expect your samples to have high diversity (like soil or ocean samples) then you will need more reads per sample to capture the full diversity and microbial signatures. If you expect your samples to have low diversity (like wine samples) then you could probably get away with less. For human and mouse samples I tend to avoid any samples with less than 3,000 reads. This is purely based on my own experience, I know others set lower minimum thresholds (like 1-2k) and others who operate with much higher. This ultimately will depend on your experiment design, what your question is etc, and your sample type.

Rarefaction curves are an ok way to determine if your sampling effort has been sufficient to detect the full diversity of that particular metric. For example in your plot it does look like all your samples do reach a saturation point in Faith’s PD and so one would be tempted to say they should be included. This is the classical approach of thinking which has inherited flaws you can read about here.
You may have captured the full diversity when looking at Faith PD but what does the plot look like when you switch it to Observed_otus? I expect you may not see those lines plateau as early. Then you may re-consider including all your samples. Again, it all depends on what you are trying to ask of your data. As for including it in a publication, I generally provide these as a supplementary figure, without proper grouping, coloring, stats supporting it, these plots don’t do too much.

With differential abundance tests, they also generally work better when the data is not so noisy. By that I mean first removing all features that have low frequency across all samples or those features that only occur in few samples. These values are also something you may have to work out from your data. This twitter thread has some good discussions about how to pick these thresholds as well should you want to go down that rabbit hole.


Dr. Mehrbod, thanks for your prompt reply. I just read the latest q2 protocol for Parkinson's Disease in Qiime2 and realised that all samples were included in the rarefaction analysis, with max-depth value setting at max features among all samples. I would like to consult your experience whether should I plot a curve using rarefied value (somewhere near median or closer to majority of sample) and do not include those samples with very shallow seq depth?

On the other hand, I would like to get the clarification on W value in ANCOM. I did read describtion in q2-forum, but still not in clear. Please refer to the following example:

W=4 for f_Bdellovibrionaceae meaning that number of subhypotheses is rejected 4 times. Ratio Bdellovibronacea and 4 other family were detected to be significantly different across treatment 1 and 2 (provided I'm comparing differential abundance between 2 grp). May I know which are the 4 other family?
Besides that, we are advised to aware of the limitation of ANCOM before applying it. How are we going to estimate if the abundance different is less than 25% since I'm exploring the unknown data where no mock data can be used to cross-validate?

Hi @Benedict,

Again, this really depends on what you are trying to do with this plot. Remember that the rarefaction curve does not actually rarefy your data for use downstream. To rarefy your table you will need to use qiime feature-table rarefy. The rarefaction curve (as demonstrated in the Parkinson’s tutorial) is used to determine a reasonable subsampling depth to be used in the core-metrics-phylogenetic step. So, you can include all of your samples in the rarefaction curve plot by setting the max depth to the maximum read/sample from your table. Since there is no statistical test being done on this plot, it matters not, it just takes a lot longer to calculate. If the plateau in your rarefaction curve begins at say 3,000 sequences and it stays the same thereafter, it probably doesn’t matter where you cut it off as long as its above 3,000. But it also doesn’t hurt to show the full data.

As I mentioned before, excluding samples with very low sequences is probably a good idea for any analysis but determining a threshold for that is entirely dependent on your data and what question you are trying to ask. Say if you were to remove all samples with less than 3,000 sequences and all of the sudden are left with n=3 in a group then that is probably not going to be very helpful when it comes to statistical testing. But if you were to set the threshold to say 2,000 and then are able to retain n=7 in a group, then that is probably a better way to proceed. But then you have to proceed with caution as 2,000 reads/sample may or may not be sufficient to capture the full diversity you are looking for (i.e. weighted UniFrac or shannon diversity). So there is a bit of a sweet spot you need to determine for each study. Note again that I am simply talking about ‘excluding’ low feature samples here using a filter and not rarefying per se. The choice of rarefying or using alternatives is up to you. For example making a PCoA plot with DEICODE doesn’t require rarefying, but if you are going to use one of the beta diversity matrices in beta-diversity then you will want to rarefy. ANCOM also doesn’t require rarefying.

As for your second question about ANCOM, I’ll keep my answer brief since this is unrelated to the original inquiry here. In the future, you should separate a new question into a new thread, that way it is easy to rediscover later and keep organized.

This is not something you can extract from Qiime2 (not in the R package either I believe), probably because it is not really useful information. Those ratios are simply used to determine the W value. If you are interesting in the relationship between clades of microbes then q2-gneiss is something you may want to consider.

That is a great question, unfortunately there is no programmatic way of knowing this within the qiime2 framework, rather this requires some prior understanding of your experiment. For example are your samples all from the same source (say comparing stool to stool) or are you comparing entirely different ecosystems, say ocean samples to hospital surfaces? It’s easy to assume the latter will have more than 25% differences while the stool samples are probably going to be pretty similar to each (unless the treatment between them has a large effect size).

Also, I wouldn’t really trust the results of your ANCOM output in this case, low W values that are identified as significant are probably due to error, see this post for more info.


Thank you very much Dr. Mehrbod for your kind advice and explanation. It’s indeed helpful.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.