How do qiime feature-table rarefy work?

kilimandjaro · June 27, 2018, 11:52am

Hi,
I would like more information about qiime feature-table rarefy.
Here i found a description https://docs.qiime2.org/2018.6/plugins/available/feature-table/rarefy/?highlight=rarefy
We can learn that qiime feature-table rarefy usage is:
“Subsample frequencies from all samples without replacement so that the sum of frequencies in each sample is equal to sampling-depth”

Do you know if the subsample is the result of one or several random draws among features (for samples whose the number of features is bigger than the sampling depth) ?

Thanks for your help,

Have a nice day

wasade · June 27, 2018, 5:04pm

Dear @kilimandjaro,

Thank you for your question! It is a single random draw. Let’s work through an example though.

Say we have a single sample. In our study, let’s say we have six total organisms/OTUs/ASVs/etc. One way we could represent this single sample would be as a vector, where the positions in the vector correspond to an organism, and the values correspond to the number of sequences in that sample that correspond to a given organism. We’ll call this a “count vector.” For example, our sample could be [0, 0, 3, 0, 2, 1] which could be interpreted as having zero sequences for the first two organisms, three sequences for the third, zero for the fourth, two for the fifth and one for the sixth.

What we’re doing in the subsample is to first “expand” the count vector so every single sequence is represented, subsample the sequences, and “compress” it back to a count vector. We can represent the sequences by their index position, which is great because this allows us to work with just numbers which computers are great with.

If we expand the above vector, we would get [2, 2, 2, 4, 4, 5]. What we’ve done is repeated each index position by the number of counts (n.b. we’re using 0-based indexing where the first element in the count vector is the zeroth index). One property here is that the length of this expanded vector is equal to the sum of the count vector.

We then randomly permute this expanded vector. One possible permutation we could get is [4, 2, 2, 4, 5, 2]. The subsample itself is to simply take the first n values of a permuted vector. In other words, if we subsampled this permuted vector to get four counts, we would get [4, 2, 2, 4].

The last trick is to transform the subsampled expanded vector back into a count vector. In the above example, our subsampled count vector would be [0, 0, 2, 0, 2, 0]. If we compare this back to our input count vector, we can see that two of the three sequences associated with the third organism are retained, all of the sequences from the fifth are retained, and zero of the sequences from the sixth are retained.

The code used in the biom.Table object is optimized, so it is a bit cryptic, but you can find that code here. It’s cryptic in part because we’re using a sparse vector and the code is expressed in Cython.

Best,
Daniel

kilimandjaro · June 28, 2018, 3:07pm

Thank you. I haven’t expected a so quick and complete answer!

I talked with a biostatistics researcher of my lab about the choice of a single draw. We concluded that it is comfortable for user because it takes more time to run more draws and also that the result will not change a lot because of the amount of the data…

Best,

wasade · June 28, 2018, 6:32pm

You’re welcome! You may also be interested in some this paper by Dr. Sophie Weiss benchmarking different normalization strategies.

Best,
Daniel

system · July 30, 2018, 12:32am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.