Thanks for the very interesting feedback @colinbrislawn,
I spent a bit of time trying to figure this out myself the last few days, and I tested three possible routes towards classifying a set of sequences. I started with a file from a pile of bat guano data (
as is customary for me...), which had been denoised in QIIME via DADA2. So, a fasta file of about 9,000 representative sequence features/ASVs (let's call it
repSeqs.qza
). Using that file, I then tried the following:
- Export that repSeqs.qza QIIME-formatted fasta file of my ASVs back into a standard fasta file. Then, run VSEARCH as a stand alone piece of software with
vsearch --clust_size
.
This produced a set of 5,483 clusters.
- I took the same
repSeqs.qza
file, and ranqiime vsearch cluster-features-de-novo
. In addition to the sequence features, I also had to provide the relatedrepTable.qza
file that contained the abundance information for each ASV.
This produced a set of 5,105 clusters.
In scratching my head why I recovered two different sets of clusters, it got me into digging a bit into the VSEARCH manual, as well as the forums - hey there again @colinbrislawn - you're all over that forum too !
I then realized the difference was because when I was exporting the repSeqs.qza
file to run in VSEARCH by itself, I lost all the abundance information. So, after a bit of fasta reformatting in R, I converted the original exported fasta that looked like this:
>0001497ccb3e484868f0c5d332e65cb3
AACATTATATTTTATTTTCGGAATTTGAGCAG...
>0002dd2a04ee401332d8d1e9ecb61057
AACATTATATTTTATTTTTGGAGCCTGATCAG...
>0016e4bbcf484b6b1cd8c75ce052cf3f
AACATTATATTTTATTTTTGGGACATTAGCTGG...
and added those abundances in the fasta headers to look like this (note the addition of the ;size=...
at the end of the header:
>0001497ccb3e484868f0c5d332e65cb3;size=14
AACATTATATTTTATTTTCGGAATTTGAGCAG...
>0002dd2a04ee401332d8d1e9ecb61057;size=15
AACATTATATTTTATTTTTGGAGCCTGATCAG...
>0016e4bbcf484b6b1cd8c75ce052cf3f;size=5
AACATTATATTTTATTTTTGGGACATTAGCTGG...
With that, I re-ran vsearch by itself, and sure enough, the clusters I get match those of the qiime vsearch cluster-features-de-novo
output - exactly the same 5,105 clusters. This leads me to believe if you take your QIIME-formatted fasta file, export it, and then try to run the same thing with VSEARCH by itself, you are going to generate differences because you lose the abundance information that QIIME provides via it's feature-table.qza artifact.
One other test I performed: could I get similar results using RESCRIPt? I really wanted to get the handy functionality present in their dereplicate tool, because it would let me both cluster the sequences, as well as apply an LCA process with a particular extension where it ignores possibly empty taxa (the --p-mode 'super' argument). Unfortunately, the way it currently operates is that it doesn't use any abundance information (you can't upload a feature-table like in qiime vsearch cluster-features-de-novo
), so while you get a very cool way to classify your taxonomy, the clusters themselves are based on sequence length, not abundances. In short, qiime rescript dereplicate
will produce the same set of 5,483 clusters observed with vsearch --clust_size
, when using the fasta file exported directly from QIIME.
As with just about everything, @Nicholas_Bokulich was super helpful in clarifying why these differences arise, and it amounts to, it appears, the intended functionality of the program - see this thread.
Cheers