how to correct for read depth when using ancombc

rahel_park · July 30, 2024, 11:02am

Any advice on how to correct for read depth covariate in qiime2 ancombc ? From permanova I saw that number of reads per sample has a significant effect, but as it is continuous variable then I cannot add it here. Would making groups of "shallow" "average" and "deep" be enough or is there any other advice for this problem?

Thanks!

salias · July 30, 2024, 6:08pm

Hi @rahel_park ,

Just curiosity, which beta diversity metric are you testing with PERMANOVA? Is it a metric that comes from a rarefied table (e.g. Bray-Curtis, Jaccard)? I ask because it feels strange to me to use depth as metadata when testing a metric that comes from a table with equal depths.

Cheers!

Sergio

rahel_park · July 30, 2024, 7:39pm

Hi @salias,
For permanova I used both bray and aitchison. With bray curtis distance, the F was 3.8 and p-value 0.001 and with aitchison F 1.20 and p-value 0.106 . I have a separate continuous variable for read number, so it's not calculated from the table, but it's in the metadata table (the mapped reads from metaphlan).

Here's the distribution of reads:

I'm now though doubting if I need to take into account the read depth with ancombc as it might already do it inherently with its sample bias correction - could it be so? (which is not the case in Maaslin2 to my understanding - and therefore I should still start my variable list with Nreads there..).

Thank you in advance!

salias · July 31, 2024, 8:30am

It looks that sample depth has no significant effect on Aitchison distances (which, unlike Bray-Curtis, are built from a non-rarefied table). Again, I am skeptical about thinking of sample depths as a possible covariate to explain the differences in sample diversity (mainly because it does not seem to me to be a covariate per se, I think it would be like using the very name of the sample as a covariate). I think that covariates have to be either biological (e.g. temperature, type of treatment, etc) or, if they are technical, have a direct explanation (e.g. DNA extraction method, library preparation method). And from the latter I think it would still be complicated to draw conclusions.

Anyway, I am just a novice in this type of analysis. Maybe using sample depths has a value that I am not seeing. If this is the case, I am sure that a senior member of the forum can advise you much better than me.

You don't need to rarefy prior to ANCOM-BC because it already estimates the unknown sampling fractions and corrects the bias induced by their differences among samples.

I'm afraid I cannot help you with this one - I'm not familiar with MaAsLin2.

rahel_park · August 1, 2024, 7:38am

Hi @salias

Thank you for the reply.
I didn’t know that the Bray-curtis table is built from a rarefied table, interesting information (I used microViz::dist_calc for both).
For me it does make sense to have a sample depths as a technical covariate. If a sample has much less reads than another one, then it should have less rare species present.
However, then I would assume some correlation between the number of reads and the number of species observed, but this is not really the case, just some slight tendency :

So, let's conclude that for ancombc there's no need to correct the read depth.

Maybe you have comments on another topic that I started on ancombc ?

salias · August 1, 2024, 9:42am

Hello again!

Technically, you don't have to rarefy for any diversity metric, but since diversity metrics are very sensitive to sequencing depth, we should normalize in some way.

The most common way is rarefying: subsample frequencies from all samples without replacement so that the sum of frequencies in each sample is equal to a sampling depth we set. To set such sampling depth, we typically plot the alpha rarefaction curves and see them to find a tradeoff between "most of the curves are plateauing" and "we don't lose too many samples". I'm aware there are more normalization methods apart from rarefying, but I recommend it since is the most common procedure and the one I use in my research¹.

Exactly! And that's why we rarefy prior to most diversity metrics.

Is this plot including all your samples? Here I would plot alpha rarefaction curves per sample instead. If you see something similar to this in such curves, that would mean that with 1.0e+07 number of reads you already catched most diversity present in your samples (i.e., curves start to plateau), so you could look for a nice point to set the sequencing depth threshold and rarefy to that depth. Then you can generate the Bray-Curtis distance matrix with that table (Aitchison and the differential abundance method ANCOM-BC do not need rarefaction, but you can try with the rarefied table anyway if you wish).

See ².

Best wishes,

Sergio

--

¹ As everything in science (and life), rarefaction is not without controversy. In this regard, @jwdebelius shared a really interesting reference in a previous post:

https://academic.oup.com/bioinformatics/article/38/9/2389/6536959?login=false

.

² I did see that post (like all the others on the forum). I did not answer because I have only used the QIIME 2 implementation of ANCOM-BC and I'm not familiar with the others . Your post is already assigned to a member of the QIIME 2 team so you should receive an answer soon. Please note that many people take their vacations in August, so it is normal that the answer of your question delays a little bit. From the QIIME 2 Community Code of Conduct:

Be patient

The moderators of this forum are the QIIME 2 developers. We strive to reply to questions within one working day (i.e., Monday through Friday, not including holidays). Remember that when you post a question to the QIIME 2 Forum, you're asking for someone's time to help you with software that they're giving to you for free. Please be patient while waiting for a reply, and don't cross-post your question.

And, in my case, I'm not even a moderator or a member of the QIIME 2 team

jwdebelius · August 1, 2024, 7:27pm

Hej @salias and @rahel_park,

Hopefully its okay if I QIIME in! ()

Everything @salias has said is true! I just wanted to comment on one thing. I think about rarefaction like buttons on men's suits:

Beta: Sometimes
Alpha: Always
DA: Never

You need to rarefy before you calcualte things involving richness or unweighted metrics becuase youre looking at presence and absence and the depth is associated with the presence/absence. Rarefaction introduces zeros you need to help deal with differences in depths. I've had discussions with colleagues over drinks () about whether its the "right" solution for differences, but at least according to that recent Pat Schloss paper it seems expeditious. (SRS is an interesting alternative, if thats your direction).

Bray Curtis distance doesn't care as much about rarefaction, becuase Bray Curtis looks at abundant organsisms, which are less sensitive to depth. I still tend to recommend it (although there are other schools) but its not going to suffer if you don't rarefy. Jaccard or UniFrac distance, which look at the presence/absence of organisms will amplify differences in depth, though!

For compositional metrics and tests like ANCOM-BC, zeros are actually a big problem! For Aitchison (and I think ANCOM), we add a psuedocount as a constant, so all the zeros become 1s and all the 1s become 2s... etc because you can't take a log or a ratio of 0. The problem is that on the log scale, 1 + 1 = 2 is the same as 32 + 32 = 64. So, we get a heavier weighting toward rare features, which is how you end up being sensitive to depth.
(FWIW and unpublished, but I find DEICODE to be a nice alternative to aitchison distance that's less sensitive to depth).

Rarefaction is introducing more zeros to the data, and so you end up having more issues with your Aitchison distance!

I tend to think about Aitchison more with filtering (how can I get rid of rare organisms to decrease my zeros), but you may have different thoughts/constraints.

Hope this helps and adds on the awesome job @salias is doing! Thank you

Best,
Justine

rahel_park · August 1, 2024, 9:13pm

Hi @salias and @jwdebelius ,
Thank you both for your comments!
Indeed, i had this impression as well that for distance calculations it's not always obligatory to normalize. Running the rarefaction curves takes some time, so meanwhile I just normalized to relative abundance the table, then bray curtis and permanova to check which variable has significant effect. Now the number of reads variable gets F 1.0385467 and p-value 0.354 (compared to F 3.8 and p-value 0.001 with counts). (Nreads is a variable in metadata, metaphlan mapped reads...). Meanwhile the values for the rest of the variables didn't change that much.
For Ancombc the only filtering in based on prevalence, but I will look into more filtering steps and try out also DEICODE.

My note about the other topic that I opened was not about rushing someone to answer, just mentioned in case you have some more insight ;), didn't expect that you read all the posts on the forum! In general I do find that the questions here are answered really fast and thoroughly, great work! And this is one of the main reasons I like QIIME - because of the active forum with a lot of helpful information.

system · September 2, 2024, 3:13am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.