Clustering denovo - specifying input sequence processing

Hi there,

I am working with 16S amplicon data on faecal samples for gut microbiome composition analysis.

I have processed my samples using the workflow in the moving pictures tutorial:

  • Demultiplexed
  • DADA2 - denoising/merging/filtering - ASVs
  • Classification
  • Phylogenetic tree construction

I have then done the rest of my analysis in R using phyloseq, vegan etc. I am finding there are a few outliers and some abundant sequences that are not classified at genus level - but are coming up as dominant in some of my treatments.

My supervisor has recommended further clustering my ASVs into OTUs at 98% similarity. I am looking at using vsearch cluster-features-de-novo to cluster these sequences at a 98% similarity threshold.

I can’t seem to find any way to specify how the input sequences are processed and how centroids are determined - eg:

  1. User supplied order
  2. Pre-sorted based on length
  3. Abundance sorted

From the QIIME2 documents it seems the only thing you can adjust is the percent identity threshold (eg. 0.98).

Are there other ways you can adjust the clustering method to specific the way input sequences are processed etc? I would want to do this based on the most abundant sequences as these are likely significant in the population.

I tried to read through the forum and couldn’t seem to find any comments/discussion on this.



Hi @chantelle.reid, welcome to :qiime2:!

Have you checked this tutorial on OTU clustering?

You can start at this step with your DADA2 output. For example:

qiime vsearch cluster-features-de-novo \
  --i-table dada2-table.qza \
  --i-sequences dada2-rep-seqs.qza \
  --p-perc-identity 0.98 \
  --o-clustered-table dada2-table-dn-98.qza \
  --o-clustered-sequences dada2-rep-seqs-dn-98.qza

-Let us know if this works.

Hi there!

Thanks for the prompt response :slight_smile:

I have used that tutorial and was able to successfully cluster my ASVs. However, I was just curious to know more about how this command determines centroids for clustering (whether it picks based on length, abundance etc) and if there was a way to alter the way this is determined.

In the vsearch literature it describes these 3 options:

  1. User supplied order
  2. Pre-sorted based on length
  3. Abundance sorted

I’m just curious to know what the default for determining centroids is within that command in QIIME2 (eg. cluster-features-de-novo)



Hi again,

I also just wanted to see if anybody knows a way you can go back to ASV level from you OTUs? Is there a way to track back and know which ASVs were mapped to which OTU?



Hi @chantelle.reid!

I believe it is sorted by abundance, behind the scenes the cluster-features-de-novo command uses the --cluster_size vsearch flag.

Not at this time.

One more question for you @thermokarst or anyone else that may be able to assist

As there is currently no way in QIIME to know which ASVs were mapped to which OTUs I ran the same thing through vsearch and added the --uc output

Total input to vsearch was:

vsearch --cluster_size TFF-rep-seqs.fasta --id 0.98 --qmask none --centroids TFF-centroids.fasta --otutabout TFF-otu.txt --uc TFF-clusters.uc

Which returned 8982 clusters/OTUs

Whereas, when I used cluster-features-de-novo in QIIME:

qiime vsearch cluster-features-de-novo
–i-table TFF_table.qza
–i-sequences TFF-rep-seqs.qza
–p-perc-identity 0.98
–o-clustered-table Clustering/TFF-table-dn-98.qza
–o-clustered-sequences Clustering/TFF-rep-seqs-dn-98.qza

I ended up with 8911 clusters/OTUs

I know these aren’t massively different, but they are different all the same. Can anybody tell me why they might be different and how I might address this?



Ah no worries, thanks for the information @thermokarst! :slight_smile:

Hi! I’m currently attempting to cluster rep seqs from DADA2 output, and I’m wondering if the rep seqs need to be trimmed to equal length prior to clustering. I was reading on some of the other qiime2 material that sequences should be trimmed to equal length prior to clustering, so I’m just trying to make sure I’m proceeding appropriately.


Hi @rhizorick,

In my experience, I’ve not had issues with variable length genes, i.e. ITS, trnL, etc… when using DADA2. See the ITS workflow from the DADA2 developers.

From their toturial:

Please ignore all the “Not all sequences were the same length.” messages in the next couple sections. We know they aren’t, and it’s OK!


1 Like

Thanks for the speedy response, Mike!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.