I'm rerunning some experiments with qiime that I initially ran with an OTU clustering threshold of 97% last year. While looking at the output statistics I thought the number of OTUs I obtained was unusually high, so i went through the sequence reduction at each stage of the pipeline to try to determine the source.
I noticed that absolutely no sequence was lost in the clustering step, which to me seems to suggest either all sequences already differed by 97% after chimera removal or clustering was not carried out. I tried repeating with a 99% clustering threshold and got the same result.
I thought this might be an issue with my data (though ASV denoising worked just fine), so I repeated the commands using the data from the moving-pictures tutorial which I believe was also the dataset used to demonstrate OTU clustering at the time. I've attached the sequence # output I got at each stage of the clustering pipeline with the moving pictures data. There's no reduction in sequence numbers at either clustering threshold.
At this point I'm just very confused, so any insight is appreciated.
Command:
(qiime2-2020.11) qiime vsearch cluster-features-de-novo --i-table table_filtered.qza --i-sequences seqs_nonchimeras.qza --p-perc-identity 0.97 --o-clustered-table table.qza --o-clustered-sequences rep-seqs.qza --verbose
Output:
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.
Command: vsearch --cluster_size /var/folders/nz/vm9d7sc94sq4dr0ttzn04pqh0000gn/T/tmpleet4rs1 --id 0.97 --centroids /var/folders/nz/vm9d7sc94sq4dr0ttzn04pqh0000gn/T/q2-DNAFASTAFormat-u1e_aipj --uc /var/folders/nz/vm9d7sc94sq4dr0ttzn04pqh0000gn/T/tmpwmn1rx3g --qmask none --xsize --threads 1 --minseqlength 1 --fasta_width 0
vsearch v2.7.0_macos_x86_64, 16.0GB RAM, 4 cores
Reading file /var/folders/nz/vm9d7sc94sq4dr0ttzn04pqh0000gn/T/tmpleet4rs1 100%
22002000 nt in 144750 seqs, min 152, max 152, avg 152
Sorting by abundance 100%
Counting k-mers 100%
Clustering 100%
Sorting clusters 100%
Writing clusters 100%
Clusters: 42173 Size min 1, max 3611, avg 3.4
Singletons: 33905, 23.4% of seqs, 80.4% of clusters
Saved FeatureTable[Frequency] to: table.qza
Saved FeatureData[Sequence] to: rep-seqs.qza