Dear @kilimandjaro,

Thank you for your question! It is a single random draw. Let’s work through an example though.

Say we have a single sample. In our study, let’s say we have six total organisms/OTUs/ASVs/etc. One way we could represent this single sample would be as a vector, where the positions in the vector correspond to an organism, and the values correspond to the number of sequences in that sample that correspond to a given organism. We’ll call this a “count vector.” For example, our sample could be `[0, 0, 3, 0, 2, 1]`

which could be interpreted as having zero sequences for the first two organisms, three sequences for the third, zero for the fourth, two for the fifth and one for the sixth.

What we’re doing in the subsample is to first “expand” the count vector so every single sequence is represented, subsample the sequences, and “compress” it back to a count vector. We can represent the sequences by their index position, which is great because this allows us to work with just numbers which computers are great with.

If we expand the above vector, we would get `[2, 2, 2, 4, 4, 5]`

. What we’ve done is repeated each index position by the number of counts (n.b. we’re using 0-based indexing where the first element in the count vector is the zeroth index). One property here is that the length of this expanded vector is equal to the sum of the count vector.

We then randomly permute this expanded vector. One possible permutation we could get is `[4, 2, 2, 4, 5, 2]`

. The subsample itself is to simply take the first `n`

values of a permuted vector. In other words, if we subsampled this permuted vector to get four counts, we would get `[4, 2, 2, 4]`

.

The last trick is to transform the subsampled expanded vector back into a count vector. In the above example, our subsampled count vector would be `[0, 0, 2, 0, 2, 0]`

. If we compare this back to our input count vector, we can see that two of the three sequences associated with the third organism are retained, all of the sequences from the fifth are retained, and zero of the sequences from the sixth are retained.

The code used in the `biom.Table`

object is optimized, so it is a bit cryptic, but you can find that code here. It’s cryptic in part because we’re using a sparse vector and the code is expressed in Cython.

Best,

Daniel