gathering a random subset of reference sequences

Is there any functionality to subset randomly from a set of reference sequences? For instance, qiime taxa filter-seqs is great when I want to retain (or exclude) some taxonomic group within a reference database, but what I I want to select just a subset of references?

Maybe this is more of a feature request for RESCRIPt @SoilRotifer, @Nicholas_Bokulich, and @thermokarst? For now I’m resorting to exporting the sequences as a .fasta and using seqkit sample to perform the subsampling, collecting those featureID’s as a text file, and filtering the original dataset with qiime feature-table filter-seqs.

Thanks for any creative thoughts if you know of a way to use existing tools within QIIME to avoid exporting/importing!


Hey @devonorourke - this is a great feature request (feel free to open a ticket at GitHub - qiime2/q2-feature-table: QIIME 2 plugin supporting operations on feature tables.). In the meantime, here is a SQL-based workaround, all in QIIME 2:

 qiime feature-table filter-seqs \
  --i-data rep-seqs.qza \
  --m-metadata-file rep-seqs.qza \
  --p-where "[Feature ID] IN (SELECT [Feature ID] FROM metadata ORDER BY RANDOM() LIMIT 10)" \
  --o-filtered-data filtered.qza

Here we select 10 random sequences, but you can change the number to match whatever threshold you need.



This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.