Qiime 2 version of Qiime1 "pick_rep_set.py" for picking representative OTU sequences

Hi there,

I am trying to pick representative sequences from the OTU table I have created so that I have one representative sequence for each OTU cluster, so I have a smaller OTU table that can then be used to generate a phylogenetic tree. Currently I have >20,000 ASV's which I have turned into an OTU table using the below command:

qiime vsearch cluster-features-de-novo
--i-table table.qza
--i-sequences rep-seqs.qza
--p-perc-identity 0.99
--o-clustered-table table-dn-99.qza
--o-clustered-sequences rep-seqs-dn-99.qza

I have found a solution from the old Qiime1 material (pick_rep_set.py – Pick representative set of sequences — Homepage) but I can't find a Qiime2 equivalent. Does this feature still exist in Qiime2? Similarly the step after where taxonomy can be assigned seems to feature in Qiime1 (454 Overview Tutorial: de novo OTU picking and diversity analyses using 454 data — Homepage) but not in Qiime2?

Thanks very much for your help.


Hi @Phoebe_Cunningham,

Welcome to the forum!

I believe what you are describing is in fact already produced by the second output in your command:

 --o-clustered-sequences rep-seqs-dn-99.qza

These are the representative sequences which will map the feature-table's (feature) IDs to some sequence.

Generally in QIIME 2 we produce the table and rep-seqs are the same time, since you need to generate the same mapping to define either of them.

What we don't have is an equivalent to the old "OTU map", which would show how different individual sequences were binned to some representative sequence.

Hi Evan,

Thanks for your response! That make sense, I guess I thought the number of sequences would reduce more when the OTUs where clustered and representative sequences pulled out. My data went from 20000 ASV's to 14000 which is still far to many to include in a phylogeny.

Is there a way to filter just the most prevalent sequences in the rep-seqs file above a certain threshold? I have filtered the feature-table but can't find the same solution for rep-seqs. Equally is it possible to filter by species i.e. finding the most prevalent sequences within a species (rather than within a sample, I have multiple samples per species)?

Thanks so much again.

Hi @Phoebe_Cunningham, You can use qiime feature-table filter-seqs to do the filtering you're requesting. First, filter the feature table how you want it (e.g., with qiime feature-table filter-features, and then call qiime feature-table filter-seqs, providing the filtered feature table as the table input. That will filter the sequences to only those whose feature id shows up in the feature table.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.