Predefined size of subsample with genome sampler

mesti90 · September 9, 2020, 2:39pm

I would like to subsample a large set of genomes with genome sampler.
For the subsampling, currently I use the following pipeline:
qiime genome-sampler sample-longitudinal
--i-context-seqs filtered-context-seqs.qza
--m-dates-file context-metadata.tsv
--m-dates-column date
--o-selection date-selection.qza

qiime genome-sampler sample-diversity \
  --i-context-seqs filtered-context-seqs.qza \
  --p-percent-id 0.9995 \
  --o-selection diversity-selection.qza

qiime genome-sampler sample-neighbors \
  --i-focal-seqs filtered-focal-seqs.qza \
  --i-context-seqs filtered-context-seqs.qza \
  --m-locale-file context-metadata.tsv \
  --m-locale-column location \
  --p-percent-id 0.9999 \
  --p-samples-per-cluster 3 \
  --o-selection neighbor-selection.qza

What I am able to do is changing the percent-id at sample-diversity and check how many sequences I have after subsampling. My goal is to tell genome-sampler that it should choose a certain number (e.g. 700) of sequences or at least an upper bound for this number. Is it possible to do so?

gregcaporaso · September 9, 2020, 7:52pm

Hi @mesti90,
Thanks for your interest in genome-sampler! At the moment we don't have a way to provide an upper bound for the number of sequences that result from subsampling. In the past I've done the same thing that you're describing to hit a target number of sequences. This could definitely be a useful feature to add in the future, but would take some re-working of we generate the subsamples. I added an issue for this on our issue tracker here.

Greg