Many questions about clustering in the qiime vsearch plugin

I reference to the following tutorial and have many questions and problems with it:

Clustering sequences into OTUs using q2-vsearch β€” QIIME 2 2023.9.2 documentation

Q1: In qiime vsearch plugin, the output obtained after different clustering of the data (De novo\closed-reference\open-reference) is very similar to the output obtained through dada2 denoised plugin. Since vsearch and dada2 denoised both cluster the data, is there any difference between the two plugins? Does dada2 use one of the clustering methods? (De novo\closed-reference\open-reference)

Q2: In the above tutorial, the reference clustering analysis used a very small 85_OTUs data, and in our actual practice we should definitely operate on a more raw rRNA data resource. If I want to use another rate of clustering refseqs (say 99%), how should I proceed? Is it to use the 2022.10.backbone.full-length.fna.qza file from this link?

Index of /greengenes_release/2022.10 (microbio.me)

Q3: I found someone mentioning a similar question in the forums.

how can i get a reference-seqs in closeed-reference clusttering - User Support - QIIME 2 Forum

Should we use Marker gene reference databases when doing reference clustering? If I want to create a new artifact for reference clustering additionally, can provide some reference tutorials?

Or what I want is to reproduce and learn how gg2 datasources are constructed, which sources should I refer to?

P1: The data I'm going to work with is single-end-with-quality from dozens of samples. I imported them as artifacts in SingleEndFastqManifestPhred33V2 format via file paths in the manifest file. But there is a problem when I process them according to vsearch's tutorial.

qiime vsearch dereplicate-sequences
--i-sequences total.qza
--p-min-seq-length 50
--p-min-unique-size 10
--o-dereplicated-table table.qza
--o-dereplicated-sequences rep-seqs.qza

Plugin error from vsearch:
Mapping not provided for observation identifier: D1_10122. If this identifier should not be updated, pass strict=False.
Debug info has been saved to /tmp/qiime2-q2cli-err-62rjm8jn.log

I had to delete some parameters to run them.

qiime vsearch dereplicate-sequences
--i-sequences total.qza
--o-dereplicated-table table.qza
--o-dereplicated-sequences rep-seqs.qza

Why does this problem occur?

P2: When I do species composition analysis, my workflow is vsearch dereplicate-sequences, vsearch cluster-features-de-novo, feature-classifier classify-sklearn, taxa barplot. Im not sure if it's ok to do this. But I found that in the feature-classifier classify-sklearn step, I can only use gg2's resources, and I get an error using silva's resources. Why is this, am I doing something wrong?

time qiime feature-classifier classify-sklearn
--i-reads rep-seqs-dn-99.qza
--i-classifier silva-138-99-515-806-nb-classifier.qza
--p-n-jobs -1
--p-confidence 0.85
--o-classification taxo.qza

Plugin error from feature-classifier:
Could not pickle the task to send it to the workers.
Debug info has been saved to /tmp/qiime2-q2cli-err-12y_j9ih.log

And I tried deleting some parameters and it didn't solve the problem.

time qiime feature-classifier classify-sklearn
--i-reads rep-seqs-dn-99.qza
--i-classifier silva-138-99-515-806-nb-classifier.qza
--o-classification taxo.qza

Killed
real 2m40.337s
user 2m12.522s
sys 0m19.820s

These problems don't happen when I use gg-13-8-99-515-806-nb-classifier.qza.

Good afternoon KonradV!

I appreciate your detailed questions and dedication to learning about this process. There is a lot going on inside each of these steps.

Here, I want to share with you the academic history behind this pipeline:

Both DADA2 and vsearch make feature tables. These are also called 'contingency tables' and for each sample, list the count of features (ASVs or OTUs) in that sample.

During the last 10 years, this field has been moving away from OTUs to ASVs. This means that the feature table is the same, but how we define each feature is very different and has different goals.

De novo\closed-reference\open-reference are methods for making OTUs.
DADA2 makes ASVs. Its method is de-novo (Latin: 'from nothing').

My recommendation is to use DADA2 instead of VSEARCH for two reasons:

  • I agree with the conclusions outlined in the paper above; ASVs should replace OTUs.
  • you can make ASVs without a database. This decouples the feature creation step with the taxonomy annotation step. Then you can try multiple different databases.

If you want to use this pipeline, it would look like this:
dada2 , feature-classifier classify-sklearn , taxa barplot

This is covered in great detail in The Qiime2 conceptual overview figure

On that same page, options for taxonomy assignment are discussed:
https://docs.qiime2.org/2023.9/tutorials/overview/#taxonomy-classification-and-taxonomic-analyses

This leads to your database questions, which we can discuss next.

Thank you for bringing these excellent questions to the forums! :qiime2:

2 Likes

Hi @KonradV,

In regards to this question:

That is a reasonable file to cluster against, and the correct one for Greengenes2 2022.10

All the best,
Daniel

1 Like

P1: on dereplication and removing features

qiime vsearch dereplicate-sequences
--i-sequences total.qza
--p-min-seq-length 50
--p-min-unique-size 10
--o-dereplicated-table table.qza
--o-dereplicated-sequences rep-seqs.qza

Plugin error from vsearch:
Mapping not provided for observation identifier: D1_10122. If this identifier should not be updated, pass strict=False.
Debug info has been saved to /tmp/qiime2-q2cli-err-62rjm8jn.log

Dereplication is used to reduce data size before clustering.
Typically, this is a lossless process for features; all unique reads become unique features in the output table.
In this example, it's a lossy process; features are discarded if they are too short (<50 bp) or too rare (total count <10). This causes the error about missing feature identifiers.
Keeping all features fixes the error.

P2: pickling?? :cucumber:

This can happen when the device is out of space. Databases are pretty big, so make sure you have plenty of space on the computer or worker node!
To investigate more, can you post that log file?Debug info has been saved to...

1 Like

I am using wsl2 on windows 11, ubuntu 22.04. I am very sorry I tried the find command and searching for the .log file in a windows window and could not find him. Can you suggest me where the .log file for qiime2 is stored by default?

1 Like

Oh... I know where the log file is, but I've lost it because it's been so long, I'll bring it up again in another question!

I'm sorry I made this post span so much time.
Back to P1. Do you mean that "qiime vsearch dereplicate-sequences" is not able to generate or process lossy data? Then why add such an optional parameter to this plugin?

My results have some short sequence length results and I don't want them to contaminate my subsequent analyses. What plugin or command should I use if I want to delete the sequences that are not enough as well as not long enough.

For my own notes, here's your original P1

qiime vsearch dereplicate-sequences
--i-sequences total.qza
--p-min-seq-length 50
--p-min-unique-size 10
--o-dereplicated-table table.qza
--o-dereplicated-sequences rep-seqs.qza

Plugin error from vsearch:
Mapping not provided for observation identifier: D1_10122. If this identifier should not be updated, pass strict=False.
Debug info has been saved to /tmp/qiime2-q2cli-err-62rjm8jn.log

The earlier parts of the pipeline may be able to remove features, but it may break other parts of this pipeline to filter early. (Maybe this is a bug in the Qiime2 vsearch plugin?)

You can filter out short reads while running DADA2 or after denoising. No vsearch required!

Hi, @colinbrislawn
Good monday morning! I reran the command and got the log file. This is the log file for the vsearch dereplicate-sequences error:

Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --derep_fulllength /tmp/q2-QIIME1DemuxDirFmt-xciz9v5f/seqs.fna --output /tmp/q2-DNAFASTAFormat-_ttqrajr --relabel_sha1 --relabel_keep --uc /tmp/tmpgyqxhhs5 --xsize --minseqlength 250 --minuniquesize 10 --fasta_width 0

vsearch v2.22.1_linux_x86_64, 31.1GB RAM, 32 cores
GitHub - torognes/vsearch: Versatile open-source tool for microbiome analysis

Dereplicating file /tmp/q2-QIIME1DemuxDirFmt-xciz9v5f/seqs.fna 100%
226391583 nt in 894291 seqs, min 250, max 433, avg 253
minseqlength 250: 98 sequences discarded.
Sorting 100%
30180 unique sequences, avg cluster 29.6, median 1, max 261758
Writing FASTA output file 100%
Writing uc file, first part 100%
Writing uc file, second part 100%
1690 uniques written, 28490 clusters discarded (94.4%)
Traceback (most recent call last):
File "/home/lisiyang/anaconda3/envs/qiime2-amplicon-2023.9/lib/python3.8/site-packages/q2cli/commands.py", line 520, in call
results = self._execute_action(
File "/home/lisiyang/anaconda3/envs/qiime2-amplicon-2023.9/lib/python3.8/site-packages/q2cli/commands.py", line 581, in _execute_action
results = action(**arguments)
File "", line 2, in dereplicate_sequences
File "/home/lisiyang/anaconda3/envs/qiime2-amplicon-2023.9/lib/python3.8/site-packages/qiime2/sdk/action.py", line 342, in bound_callable
outputs = self.callable_executor(
File "/home/lisiyang/anaconda3/envs/qiime2-amplicon-2023.9/lib/python3.8/site-packages/qiime2/sdk/action.py", line 566, in callable_executor
output_views = self._callable(**view_args)
File "/home/lisiyang/anaconda3/envs/qiime2-amplicon-2023.9/lib/python3.8/site-packages/q2_vsearch/_cluster_sequences.py", line 45, in dereplicate_sequences
table.update_ids(id_map=id_map, axis='observation')
File "/home/lisiyang/anaconda3/envs/qiime2-amplicon-2023.9/lib/python3.8/site-packages/biom/table.py", line 1410, in update_ids
raise TableException(
biom.exception.TableException: Mapping not provided for observation identifier: D1_10122. If this identifier should not be updated, pass strict=False.

Under the dada2 plugin I think the length of the sequence is trimmed with the trunc and trim parameters, are the reads that don't reach the required length deleted? As well as I don't see a filter parameter for the number of duplicate sequences under the dada2 plugin.

That log file gives additional information about how the vsearch command filters data.
The vsearch part of this command finishes successfully, then the second step fails because reads are missing from the table.

That's right! After you import data, you can run DADA2 and the DADA2 pipeline takes care of filtering, joining, and dereplicating.

It looks like these reads have been joined. It is best to run DADA2 on reads before they have been joined.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.