I reference to the following tutorial and have many questions and problems with it:
Clustering sequences into OTUs using q2-vsearch β QIIME 2 2023.9.2 documentation
Q1: In qiime vsearch plugin, the output obtained after different clustering of the data (De novo\closed-reference\open-reference) is very similar to the output obtained through dada2 denoised plugin. Since vsearch and dada2 denoised both cluster the data, is there any difference between the two plugins? Does dada2 use one of the clustering methods? (De novo\closed-reference\open-reference)
Q2: In the above tutorial, the reference clustering analysis used a very small 85_OTUs data, and in our actual practice we should definitely operate on a more raw rRNA data resource. If I want to use another rate of clustering refseqs (say 99%), how should I proceed? Is it to use the 2022.10.backbone.full-length.fna.qza file from this link?
Q3: I found someone mentioning a similar question in the forums.
how can i get a reference-seqs in closeed-reference clusttering - User Support - QIIME 2 Forum
Should we use Marker gene reference databases when doing reference clustering? If I want to create a new artifact for reference clustering additionally, can provide some reference tutorialsοΌ
Or what I want is to reproduce and learn how gg2 datasources are constructed, which sources should I refer to?
P1: The data I'm going to work with is single-end-with-quality from dozens of samples. I imported them as artifacts in SingleEndFastqManifestPhred33V2 format via file paths in the manifest file. But there is a problem when I process them according to vsearch's tutorial.
qiime vsearch dereplicate-sequences
--i-sequences total.qza
--p-min-seq-length 50
--p-min-unique-size 10
--o-dereplicated-table table.qza
--o-dereplicated-sequences rep-seqs.qzaPlugin error from vsearch:
Mapping not provided for observation identifier: D1_10122. If this identifier should not be updated, pass strict=False.
Debug info has been saved to /tmp/qiime2-q2cli-err-62rjm8jn.log
I had to delete some parameters to run them.
qiime vsearch dereplicate-sequences
--i-sequences total.qza
--o-dereplicated-table table.qza
--o-dereplicated-sequences rep-seqs.qza
Why does this problem occur?
P2: When I do species composition analysis, my workflow is vsearch dereplicate-sequences, vsearch cluster-features-de-novo, feature-classifier classify-sklearn, taxa barplot. Im not sure if it's ok to do this. But I found that in the feature-classifier classify-sklearn step, I can only use gg2's resources, and I get an error using silva's resources. Why is this, am I doing something wrong?
time qiime feature-classifier classify-sklearn
--i-reads rep-seqs-dn-99.qza
--i-classifier silva-138-99-515-806-nb-classifier.qza
--p-n-jobs -1
--p-confidence 0.85
--o-classification taxo.qzaPlugin error from feature-classifier:
Could not pickle the task to send it to the workers.
Debug info has been saved to /tmp/qiime2-q2cli-err-12y_j9ih.log
And I tried deleting some parameters and it didn't solve the problem.
time qiime feature-classifier classify-sklearn
--i-reads rep-seqs-dn-99.qza
--i-classifier silva-138-99-515-806-nb-classifier.qza
--o-classification taxo.qzaKilled
real 2m40.337s
user 2m12.522s
sys 0m19.820s
These problems don't happen when I use gg-13-8-99-515-806-nb-classifier.qza.