Normalisation Seq Count

Mehrbod_Estaki · October 29, 2019, 7:18pm

Again, this really depends on what you are trying to do with this plot. Remember that the rarefaction curve does not actually rarefy your data for use downstream. To rarefy your table you will need to use qiime feature-table rarefy. The rarefaction curve (as demonstrated in the Parkinson's tutorial) is used to determine a reasonable subsampling depth to be used in the core-metrics-phylogenetic step. So, you can include all of your samples in the rarefaction curve plot by setting the max depth to the maximum read/sample from your table. Since there is no statistical test being done on this plot, it matters not, it just takes a lot longer to calculate. If the plateau in your rarefaction curve begins at say 3,000 sequences and it stays the same thereafter, it probably doesn't matter where you cut it off as long as its above 3,000. But it also doesn't hurt to show the full data.

As I mentioned before, excluding samples with very low sequences is probably a good idea for any analysis but determining a threshold for that is entirely dependent on your data and what question you are trying to ask. Say if you were to remove all samples with less than 3,000 sequences and all of the sudden are left with n=3 in a group then that is probably not going to be very helpful when it comes to statistical testing. But if you were to set the threshold to say 2,000 and then are able to retain n=7 in a group, then that is probably a better way to proceed. But then you have to proceed with caution as 2,000 reads/sample may or may not be sufficient to capture the full diversity you are looking for (i.e. weighted UniFrac or shannon diversity). So there is a bit of a sweet spot you need to determine for each study. Note again that I am simply talking about 'excluding' low feature samples here using a filter and not rarefying per se. The choice of rarefying or using alternatives is up to you. For example making a PCoA plot with DEICODE doesn't require rarefying, but if you are going to use one of the beta diversity matrices in beta-diversity then you will want to rarefy. ANCOM also doesn't require rarefying.

As for your second question about ANCOM, I'll keep my answer brief since this is unrelated to the original inquiry here. In the future, you should separate a new question into a new thread, that way it is easy to rediscover later and keep organized.

This is not something you can extract from Qiime2 (not in the R package either I believe), probably because it is not really useful information. Those ratios are simply used to determine the W value. If you are interesting in the relationship between clades of microbes then q2-gneiss is something you may want to consider.

That is a great question, unfortunately there is no programmatic way of knowing this within the qiime2 framework, rather this requires some prior understanding of your experiment. For example are your samples all from the same source (say comparing stool to stool) or are you comparing entirely different ecosystems, say ocean samples to hospital surfaces? It's easy to assume the latter will have more than 25% differences while the stool samples are probably going to be pretty similar to each (unless the treatment between them has a large effect size).

Also, I wouldn't really trust the results of your ANCOM output in this case, low W values that are identified as significant are probably due to error, see this post for more info.