gathering a random subset of reference sequences

Is there any functionality to subset randomly from a set of reference sequences? For instance, qiime taxa filter-seqs is great when I want to retain (or exclude) some taxonomic group within a reference database, but what I I want to select just a subset of references?

Maybe this is more of a feature request for RESCRIPt @SoilRotifer, @Nicholas_Bokulich, and @thermokarst? For now I’m resorting to exporting the sequences as a .fasta and using seqkit sample to perform the subsampling, collecting those featureID’s as a text file, and filtering the original dataset with qiime feature-table filter-seqs.

Thanks for any creative thoughts if you know of a way to use existing tools within QIIME to avoid exporting/importing!

Devon

Hey @devonorourke - this is a great feature request (feel free to open a ticket at GitHub - qiime2/q2-feature-table: QIIME 2 plugin supporting operations on feature tables.). In the meantime, here is a SQL-based workaround, all in QIIME 2:

 qiime feature-table filter-seqs \
  --i-data rep-seqs.qza \
  --m-metadata-file rep-seqs.qza \
  --p-where "[Feature ID] IN (SELECT [Feature ID] FROM metadata ORDER BY RANDOM() LIMIT 10)" \
  --o-filtered-data filtered.qza

Here we select 10 random sequences, but you can change the number to match whatever threshold you need.

:qiime2:

4 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.