vsearch de novo output --uc option

chantelle.reid · September 25, 2020, 5:20am

Hi all,

I am currently working on a dataset where I've clustered my ASVs to 98% similarity. I've used cluster-features-de-novo to do this. However, in the event that I want to know which ASVs were clustered to which OTU there is currently no option to trace back. Is it possible at all to get the --uc output option added to this so that you can track back to ASVs?

Kind regards,

Chantelle

jcmcnch · October 3, 2020, 10:43pm

Hi Chantelle,

I ran into this issue too, and it's not currently possible AFAIK. The uc file is deleted automatically after clustering. I previously was able to avoid it being deleted by hacking the qiime2 file, but this doesn't seem to work any more (although you can try the workaround, see here: When clustering ASVs from deblur with VSEARCH, is it possible to know the membership of resulting clusters? - #7 by colinbrislawn).

I do want to fix this though (see here: Guidance modifying q2-vsearch to recover UC file) but have not yet gotten around to it yet and am not sure even if my coding knowledge is sufficient to do so... So that's where it stands right now to the best of my knowledge. I will try to take a look this coming week though to see if it's something I could even handle and get back to you!

Jesse

chantelle.reid · October 4, 2020, 10:26pm

Thanks for the feedback Jesse! No worries, it would be great to have as an option

I'll try the work around for now!

devonorourke · October 5, 2020, 12:28am

Not sure if you've already explored this option, but you should be able to run vsearch as a stand-alone piece of software outside of QIIME to generate the .uc file, while still maintaining your QIIME-based outputs. You'd just have to run the commands once in QIIME (to get the .qza files), and one with just vsearch (to get the .uc file).

Say you wanted to cluster at 98% id with QIIME:

qiime vsearch cluster-features-de-novo \
  --i-table table.qza \
  --i-sequences rep-seqs.qza \
  --p-perc-identity 0.98 \
  --o-clustered-table clust_98-table.qza \
  --o-clustered-sequences clust_98-seqs.qza

The docs on cluster-features-de-novo mention you're calling up this q2-vsearch Python script for clustering - it amounts to calling vsearch --cluster_size; the trick would be to make sure you match the same parameters you invoke in QIIME. I'd imagine the base parameters would look something like:

vsearch --cluster_size ./path/to/fasta/equivalent/of/rep-seqs.qza/dna-sequences.fasta \
  --id 0.98 \
  --strand plus \
  --centroids clust_98-seqs.fasta \
  --uc clust_98.uc \
  --log clust_98.log

Good luck!

chantelle.reid · October 5, 2020, 3:03am

@devonorourke I did end up running the same in vsearch. It just gave me a slightly different total sum of features identified (I'm assuming there would be slight variation each time you run it based on whatever centroids are picked).

I guess what I was trying to figure out now was how to export my data for vsearch to cluster and then import again into qiime for phylogenetic tree construction and classification. I realise that it's unlikely to differ much from the cluster-features-de-novo output but it will differ...

devonorourke · October 5, 2020, 2:05pm

Thanks @chantelle.reid,

Glad you were able to get what you needed. I bet there are folks here that can provide better insights regarding the reproducibility of your results; maybe @colinbrislawn?
I believe there is an option for -randseed within vsearch that might assist in making sure results line up between trials, but that argument doesn't appear to be invoked with the QIIME version.

My hunch is that the QIIME method and VSEARCH methods should produce the same results, and that the centroids are identified based on abundances of sequences. In the QIIME version, this information is available in the feature-table.qza file you supply, but with VSEARCH, those abundances aren't immediately available if you simply export the fasta file from the .qza file. I believe you should be able to get the same centroids identified provided that you've gathered those abundance values for each representative sequence and included them in the fasta header, then run vsearch with the --sizein argument.

Then again, you might not care about it if the differences are trivial!

Cheers

colinbrislawn · October 6, 2020, 4:38pm

Hello Chantelle and Devon,

It just gave me a slightly different total sum of features identified (I’m assuming there would be slight variation each time you run it based on whatever centroids are picked).

This is a really good question, and gives us a good chance to dive into how vsearch works.

Vsearch provides several similar algorithms for clustering sequences.
--cluster_fast | --cluster_size | --cluster_smallmem
The only differance between these commands is how they sort the input sequences. Fast sorts by length, size sorts by abundance, and smallmem does not sort at all.

Given an identical input file and search settings (--maxrejects, --maxaccepts, etc.) vsearch should produce identical results.

So why are your counts slightly different?

My first through would be to check the settings that Qiime uses verses the vsearch defaults. Here's the settings Qiime 2 uses for vsearch --cluster_size

github.com

qiime2/q2-vsearch/blob/dev/q2_vsearch/_cluster_features.py#L188-L197


      
                                   ) -> (biom.Table, DNAFASTAFormat):
          clustered_sequences = DNAFASTAFormat()
          with tempfile.NamedTemporaryFile() as fasta_with_sizes:
              with tempfile.NamedTemporaryFile() as out_uc:
                  _fasta_with_sizes(str(sequences), fasta_with_sizes.name, table)
                  cmd = ['vsearch',
                         '--cluster_size', fasta_with_sizes.name,
                         '--id', str(perc_identity),
                         '--centroids', str(clustered_sequences),
                         '--uc', out_uc.name,

Notice how qiime 2 uses no masking (--qmask none) but vsearch uses dust masking by default. How interesting!

Small changes in results could also result from updates to vsearch. Q2-vsearch uses vsearch 2.7.0 from 2018, so if you are using a newer release, results could change

I realise that it’s unlikely to differ much from the cluster-features-de-novo output but it will differ…

I agree! These small changes should still tell the same biological story and hopefully build a very similar phylogenetic tree.

Keep us posted!
Colin

devonorourke · October 6, 2020, 4:55pm

Thanks for the very interesting feedback @colinbrislawn,

I spent a bit of time trying to figure this out myself the last few days, and I tested three possible routes towards classifying a set of sequences. I started with a file from a pile of bat guano data ( as is customary for me...), which had been denoised in QIIME via DADA2. So, a fasta file of about 9,000 representative sequence features/ASVs (let's call it repSeqs.qza). Using that file, I then tried the following:

Export that repSeqs.qza QIIME-formatted fasta file of my ASVs back into a standard fasta file. Then, run VSEARCH as a stand alone piece of software with vsearch --clust_size.

This produced a set of 5,483 clusters.

I took the same repSeqs.qza file, and ran qiime vsearch cluster-features-de-novo. In addition to the sequence features, I also had to provide the related repTable.qza file that contained the abundance information for each ASV.

This produced a set of 5,105 clusters.

In scratching my head why I recovered two different sets of clusters, it got me into digging a bit into the VSEARCH manual, as well as the forums - hey there again @colinbrislawn - you're all over that forum too !

I then realized the difference was because when I was exporting the repSeqs.qza file to run in VSEARCH by itself, I lost all the abundance information. So, after a bit of fasta reformatting in R, I converted the original exported fasta that looked like this:

>0001497ccb3e484868f0c5d332e65cb3
AACATTATATTTTATTTTCGGAATTTGAGCAG...
>0002dd2a04ee401332d8d1e9ecb61057
AACATTATATTTTATTTTTGGAGCCTGATCAG...
>0016e4bbcf484b6b1cd8c75ce052cf3f
AACATTATATTTTATTTTTGGGACATTAGCTGG...

and added those abundances in the fasta headers to look like this (note the addition of the ;size=... at the end of the header:

>0001497ccb3e484868f0c5d332e65cb3;size=14
AACATTATATTTTATTTTCGGAATTTGAGCAG...
>0002dd2a04ee401332d8d1e9ecb61057;size=15
AACATTATATTTTATTTTTGGAGCCTGATCAG...
>0016e4bbcf484b6b1cd8c75ce052cf3f;size=5
AACATTATATTTTATTTTTGGGACATTAGCTGG...

With that, I re-ran vsearch by itself, and sure enough, the clusters I get match those of the qiime vsearch cluster-features-de-novo output - exactly the same 5,105 clusters. This leads me to believe if you take your QIIME-formatted fasta file, export it, and then try to run the same thing with VSEARCH by itself, you are going to generate differences because you lose the abundance information that QIIME provides via it's feature-table.qza artifact.

One other test I performed: could I get similar results using RESCRIPt? I really wanted to get the handy functionality present in their dereplicate tool, because it would let me both cluster the sequences, as well as apply an LCA process with a particular extension where it ignores possibly empty taxa (the --p-mode 'super' argument). Unfortunately, the way it currently operates is that it doesn't use any abundance information (you can't upload a feature-table like in qiime vsearch cluster-features-de-novo), so while you get a very cool way to classify your taxonomy, the clusters themselves are based on sequence length, not abundances. In short, qiime rescript dereplicate will produce the same set of 5,483 clusters observed with vsearch --clust_size, when using the fasta file exported directly from QIIME.

As with just about everything, @Nicholas_Bokulich was super helpful in clarifying why these differences arise, and it amounts to, it appears, the intended functionality of the program - see this thread.

Cheers

jcmcnch · October 9, 2020, 12:09am

Hi folks,

If you want to use the simple hacky workaround I alluded to before to preserve the UC file from qiime2, here's a way to do it (my previous instructions don't seem to work anymore and were probably not that clear anyway):

Follow the steps here to set up a qiime2 development conda environment.
Make sure you're in the environment - your shell should look something like this: (qiime2-dev) jesse@ubuntu-delltron:/testing $
Download my hacked* q2-vsearch env as follows: git clone git@github.com:jcmcnch/q2-vsearch.git, then enter the directory q2-vsearch.
Make and install the vsearch package: make dev && make test. You should get some warnings but no errors.
Now refresh the cache in your conda environment to make sure the updated q2-vsearch gets used when you do your clustering (NB: I only changed this for de novo clustering): qiime dev refresh-cache
Run de novo clustering in qiime2. Look in the text output for the location of the UC file and copy it somewhere to keep the info as illustrated in the image below
Optional - you can then parse the UC file using another script I've put together here to get membership info and a summary (i.e. which ASVs go into each centroid found in the qiime2 table output). The script has some basic documentation about how to run it which you can access by passing the --help flag.

*I only changed one line in one python file so the UC temp file doesn't get deleted. This workaround, although bush-league, will avoid you having to run VSEARCH outside of qiime2 which is pretty convenient in terms of data processing and will avoid the issues of comparability discussed above by Chantelle, Colin, and Devon.

chantelle.reid · October 14, 2020, 9:53am

Hi @jcmcnch,

This is great and I've been able to go all the way through until clustering. However, I can't seem to copy across the uc file output from the temporary files. Whenever I try to it just comes back with 'No such file or directory'. I then tried to access my temporary files using 'open $TMPDIR' to copy it across manually but there as no sign of the file in there?

I'm doing this all through the MacOS terminal. Do you know why it might be that I can't just use 'cp' in the terminal to move it across or find the file in the temporary folder?

I'll keep at it for now but if you have any suggestions it would be appreciated. I'll feel silly if it's something obvious....

Thanks,

Chantelle

jcmcnch · October 14, 2020, 4:55pm

Hi Chantelle,

I don't see a reason why it won't work on MacOS, cp should be the same with linux I think (but I haven't really used MacOS much so I'm not really sure). I did have the same problem as you before (which I guess is the default behavior of qiime - i.e. to delete the file). In my case, it was because the hacked script hadn't been updated in my qiime installation, so vsearch was still deleting the UC file. To fix it, what I did was run qiime dev refresh-cache. I know you've already done this but I guess it doesn't hurt to try again just in case it helps. Also worth double-checking that you're in the right environment by running which qiime which should show you're in the qiime2-dev environment. If neither of those two things point in the right direction then I have some other ideas about how to check.

HTH,

Jesse

difont10 · October 15, 2020, 1:35pm

Hi Jesse,

This is great -- thank you for posting the details instructions! I seem to be having issues at the 'make dev && make test' step. I am using a computing cluster so maybe my access to make direct changes is limited in this way. I get an error saying:

error: can't create or remove files in install directory

The following error occurred while trying to add or remove files in the
installation directory:

    [Errno 13] Permission denied: '/gpfs/runtime/opt/python/2.7.12/lib/python2.7/site-packages/test-easy-install-51319.write-test'

The installation directory you specified (via --install-dir, --prefix, or
the distutils default setting) was:

    /gpfs/runtime/opt/python/2.7.12/lib/python2.7/site-packages/

Perhaps your account does not have write access to this directory?  If the
installation directory is a system-owned directory, you may need to sign in
as the administrator or "root" account.  If you do not have administrative
access to this machine, you may wish to choose a different installation
directory, preferably one that is listed in your PYTHONPATH environment
variable.

So this tells me that it is an administrative issue that I can't make direct changes? Not sure if you have any thoughts on this matter but just wanted to check.

Thank you,
Diana

chantelle.reid · October 15, 2020, 10:20pm

@jcmcnch success! I misinterpreted your instructions and had switched back to using my regular qiime environment so it was deleting the temporary file. I'm going to attempt to parse the UC file (I have some significant OTUs that have come up in my study and I would like to know which ASVs/how many ASVs belong to it).

Really appreciate your help with this!

jcmcnch · October 15, 2020, 10:54pm

@chantelle.reid - fantastic, I'm glad this worked for you! It's something I wanted to do anyway for my own work, so it's good to see it could help someone else too. Let me know if the python script for parsing UC files gives you any issues too, would be happy to troubleshoot that as well.

chantelle.reid · October 16, 2020, 3:41am

@jcmcnch - I'll be honest I haven't used python much and I'm still pretty fresh to using the terminal

So far I've managed to download the script 'parse_VSEARCH_cluster_membership_from_UC_file.py' from github. I know you mention the script has documentation on how to run it (which can be viewed on github) but I am not familiar with using python scripts and it's not making a lot of sense to me.

My questions are:

Do I need to put this script into the same folder that contains my .uc file?
Do I also need to either add both of these to the same directory python runs from on my computer? In my case its located here: users/chantellereid/miniconda/bin and if I type in 'python' to my terminal it activates it
How do I then use/execute the script to process my .uc file to output with membership information/summary?

Thanks in advance!

Chantelle

jcmcnch · October 17, 2020, 1:49am

Hi Diana @difont10,

No problem, glad it is potentially useful to you. Yes, on the server you will be unable to access the system python unless you're an admin. But if you have been able to install the qiime2-dev environment, it should be possible to run the make steps in that environment. Then it will be accessing your local python not the system python as in the error you're getting.

HTH,
Jesse

jcmcnch · October 17, 2020, 2:00am

Hi @chantelle.reid - if you could follow my instructions above then this should be similarly easy!

It doesn't need to be in the same folder as the uc file but it is convenient to put it there.
You don't need the python script to be where python is installed.
Here's how I suggest you run it:

Put the python script in a folder with your UC file
In a terminal, you can run ./parse_VSEARCH_cluster_membership_from_UC_file.py --help to access the documentation. Some instructions are specific for our lab, you can ignore the bit about the hacked environment, that was the previous workaround. FYI, the first line of the script (#!/usr/bin/env python) means it'll access whichever python is in your current environment.
You can then run it as follows: ./parse_VSEARCH_cluster_membership_from_UC_file.py --input <your UC file goes here> --output_summary <name for your summary output> --output_lookup <a name for your lookup table>

Your output files should then be generated in the same directory, according to the name you provided.

If you want to know which ASV is in which OTU, you can just look in the summary output file - information on what the columns mean can be found the script documentation.

HTH,
Jesse

difont10 · October 19, 2020, 3:58pm

Hi Jesse,

Thank you for the help! I was able to get the uc file and parse it correctly from your .py script. Thanks a bunch, super helpful.
Best,
Diana

jcmcnch · October 19, 2020, 4:27pm

Hi Diana,

You're welcome! Glad it was useful to you.

Jesse

jcmcnch · August 17, 2022, 12:15am

Not as far as I'm aware @ntmarshall406. I did manage to recover the UC file in the hacky way that I detailed in another post (vsearch de novo output --uc option - #6 by devonorourke) but I don't think it's ever been implemented formally in the qiime2 options. I lack the programming expertise / time to really know how to integrate this change into the qiime2 workflow. But the hack does work last time I checked!

Jesse