Thanks for the very interesting feedback @colinbrislawn,
I spent a bit of time trying to figure this out myself the last few days, and I tested three possible routes towards classifying a set of sequences. I started with a file from a pile of bat guano data ( as is customary for me…), which had been denoised in QIIME via DADA2. So, a fasta file of about 9,000 representative sequence features/ASVs (let’s call it
repSeqs.qza). Using that file, I then tried the following:
- Export that repSeqs.qza QIIME-formatted fasta file of my ASVs back into a standard fasta file. Then, run VSEARCH as a stand alone piece of software with
This produced a set of 5,483 clusters.
- I took the same
repSeqs.qza file, and ran
qiime vsearch cluster-features-de-novo. In addition to the sequence features, I also had to provide the related
repTable.qza file that contained the abundance information for each ASV.
This produced a set of 5,105 clusters.
In scratching my head why I recovered two different sets of clusters, it got me into digging a bit into the VSEARCH manual, as well as the forums - hey there again @colinbrislawn - you’re all over that forum too !
I then realized the difference was because when I was exporting the
repSeqs.qza file to run in VSEARCH by itself, I lost all the abundance information. So, after a bit of fasta reformatting in R, I converted the original exported fasta that looked like this:
and added those abundances in the fasta headers to look like this (note the addition of the
;size=... at the end of the header:
With that, I re-ran vsearch by itself, and sure enough, the clusters I get match those of the
qiime vsearch cluster-features-de-novo output - exactly the same 5,105 clusters. This leads me to believe if you take your QIIME-formatted fasta file, export it, and then try to run the same thing with VSEARCH by itself, you are going to generate differences because you lose the abundance information that QIIME provides via it’s feature-table.qza artifact.
One other test I performed: could I get similar results using RESCRIPt? I really wanted to get the handy functionality present in their dereplicate tool, because it would let me both cluster the sequences, as well as apply an LCA process with a particular extension where it ignores possibly empty taxa (the --p-mode ‘super’ argument). Unfortunately, the way it currently operates is that it doesn’t use any abundance information (you can’t upload a feature-table like in
qiime vsearch cluster-features-de-novo), so while you get a very cool way to classify your taxonomy, the clusters themselves are based on sequence length, not abundances. In short,
qiime rescript dereplicate will produce the same set of 5,483 clusters observed with
vsearch --clust_size, when using the fasta file exported directly from QIIME.
As with just about everything, @Nicholas_Bokulich was super helpful in clarifying why these differences arise, and it amounts to, it appears, the intended functionality of the program - see this thread.