OTU cluster de-novo command does not affect sequence numbers?

I'm rerunning some experiments with qiime that I initially ran with an OTU clustering threshold of 97% last year. While looking at the output statistics I thought the number of OTUs I obtained was unusually high, so i went through the sequence reduction at each stage of the pipeline to try to determine the source.

I noticed that absolutely no sequence was lost in the clustering step, which to me seems to suggest either all sequences already differed by 97% after chimera removal or clustering was not carried out. I tried repeating with a 99% clustering threshold and got the same result.

I thought this might be an issue with my data (though ASV denoising worked just fine), so I repeated the commands using the data from the moving-pictures tutorial which I believe was also the dataset used to demonstrate OTU clustering at the time. I've attached the sequence # output I got at each stage of the clustering pipeline with the moving pictures data. There's no reduction in sequence numbers at either clustering threshold.

At this point I'm just very confused, so any insight is appreciated.

(qiime2-2020.11) qiime vsearch cluster-features-de-novo --i-table table_filtered.qza --i-sequences seqs_nonchimeras.qza --p-perc-identity 0.97 --o-clustered-table table.qza --o-clustered-sequences rep-seqs.qza --verbose

Running external command line application. This may print messages to stdout and/or stderr.

The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --cluster_size /var/folders/nz/vm9d7sc94sq4dr0ttzn04pqh0000gn/T/tmpleet4rs1 --id 0.97 --centroids /var/folders/nz/vm9d7sc94sq4dr0ttzn04pqh0000gn/T/q2-DNAFASTAFormat-u1e_aipj --uc /var/folders/nz/vm9d7sc94sq4dr0ttzn04pqh0000gn/T/tmpwmn1rx3g --qmask none --xsize --threads 1 --minseqlength 1 --fasta_width 0

vsearch v2.7.0_macos_x86_64, 16.0GB RAM, 4 cores

Reading file /var/folders/nz/vm9d7sc94sq4dr0ttzn04pqh0000gn/T/tmpleet4rs1 100%

22002000 nt in 144750 seqs, min 152, max 152, avg 152

Sorting by abundance 100%

Counting k-mers 100%

Clustering 100%

Sorting clusters 100%

Writing clusters 100%

Clusters: 42173 Size min 1, max 3611, avg 3.4

Singletons: 33905, 23.4% of seqs, 80.4% of clusters

Saved FeatureTable[Frequency] to: table.qza

Saved FeatureData[Sequence] to: rep-seqs.qza

Hi @elisabeth, great writeup!

I might be misreading the table, but aren't you showing that uchime, 97%, and 99% columns are all less than the dereplicated column?

This table appears to be showing us total counts of sequences - I think what you might actually be looking for though is the number of features observed (not the total count of those observations).

If I have 4 samples, and 3 features, a theoretical feature table might look like:

f1 f2 f3
s1 0 75 25
s2 10 20 80
s3 100 2 40
s4 1 0 0

s1 has 100 counts, s2 has 110 counts, s3 has 142 counts, and s4 has 1 count.

If I apply some hypothetical OTU clustering which clustered f1 and f2 together, the table would look like this:

otu_a otu_b
s1 75 25
s2 30 80
s3 102 40
s4 1 0

Notice though, the per sample counts are unchanged:
s1 has 100 counts, s2 has 110 counts, s3 has 142 counts, and s4 has 1 count.

I'll leave it as an exercise for you, but if you cluster f2 and f3 together, you'll get the same counts, and if you cluster f1, f2, and f3 together, same story.

It is worth noting that if you did this exercise with closed reference clustering, it would potentially be a different story, because some features would or would not be matched to the reference database.

So my suggestion to you is to go back to your spreadsheet and add a new column with "features" for each of the clustering methods you have used. You can get this information by running feature-table summarize on the tables output from each of your trials:

I think you are tabulating the "total frequency" data in your table, but perhaps you mean to be looking at "number of features."

I apologize if I misunderstood your question - please let me know!


Thanks so much, I’ll try this!

Another slightly more philosophical question that I’m still wondering about is: why are the numbers of ASVs so much lower than the number of OTUs, either at 97 or 99%? One would expect a priori that there would be fewer OTUs than ASVs, because ASVs do not have clusters, but in every dataset (including test datasets) I’ve run through both dada2 and vsearch there have been far more OTUs than ASVs. Is the filtering for sequences in the ASV pipeline that much more stringent?

Thanks again!

It might be helpful for you to think of ASVs as 100% clusters. Also, that sentiment might be neglecting to account for the fact that denoising methods like DADA2 and deblur operate primarily by correcting sequencing error.

It turns out that when that error is actually corrected, we see a lot fewer "unique sequences." One thing to keep in mind is that OTUs in microbiome bioinformatics have often traditionally been used as a mechanism to account for sequencing error, but grouping similar sequences together. With denoising methods we find that that kind of clustering isn't necessary, because the denoising methods produce error corrected data, generally of a very high caliber.

I think what you're seeing lines up very well with the literature and the general experiences we have seen others report here on the forum.

Hope that helps!


1 Like

Thanks for the explanation! I made some time yesterday to read through some more of the literature surrounding this as well and I think I’ve got an idea what’s going on. Thanks again for your help.