Hi
My commands as follow:
qiime tools import \
--input-path $2/seqs.fna \
--output-path $2/seqs.qza \
--type 'SampleData[Sequences]'
# import reference database gg13_8
qiime tools import \
--input-path gg_13_8_otus/rep_set/97_otus.fasta \
--output-path $2/97_otus.qza \
--type 'FeatureData[Sequence]'
# dereplication
mkdir -p $2/dereplicated
qiime vsearch dereplicate-sequences \
--i-sequences $2/seqs.qza \
--o-dereplicated-table $2/dereplicated/table.qza \
--o-dereplicated-sequences $2/dereplicated/rep-seqs.qza
# clustering 97% OTU
mkdir -p $2/clustered
# open reference
qiime vsearch cluster-features-open-reference \
--i-table $2/dereplicated/table.qza \
--i-sequences $2/dereplicated/rep-seqs.qza \
--i-reference-sequences $2/97_otus.qza \
--p-perc-identity 0.97 \
--o-clustered-table $2/clustered/table-or-97.qza \
--o-clustered-sequences $2/clustered/rep-seqs-or-97.qza \
--o-new-reference-sequences $2/clustered/new-ref-seqs-or-97.qza \
--p-threads 48
# remove chimera - open reference
# 1. run de novo chimera checking
qiime vsearch uchime-denovo \
--i-table $2/clustered/table-or-97.qza \
--i-sequences $2/clustered/rep-seqs-or-97.qza \
--output-dir $2/uchime-dn-out
# 2. visualize chimera result
qiime metadata tabulate \
--m-input-file $2/uchime-dn-out/stats.qza \
--o-visualization $2/uchime-dn-out/stats.qzv
# 3. Exclude chimeras but retain “borderline chimeras
qiime feature-table filter-features \
--i-table $2/clustered/table-or-97.qza \
--m-metadata-file $2/uchime-dn-out/chimeras.qza \
--p-exclude-ids \
--o-filtered-table $2/uchime-dn-out/table-nonchimeric-w-borderline.qza
qiime feature-table filter-seqs \
--i-data $2/clustered/rep-seqs-or-97.qza \
--m-metadata-file $2/uchime-dn-out/chimeras.qza \
--p-exclude-ids \
--o-filtered-data $2/uchime-dn-out/rep-seqs-nonchimeric-w-borderline.qza
Q1: Finally, with 159 samples, I have 74219 OTUs. Is that too much?
I searched the forum with a similar question: Denoising vs OTU picking methods - #5 by Dchung. In this post, he has 80000, which is too high, so i am thinking if vsearch open-reference pipeline reliable?
Q2: Can I do downstream diversity analysis by using greengenes' trees?
I have read a related questions about how to build phylogenetic tree after using vsearch otu-clustering method, I searched the posts in forum with this similar question :
In this question, it mentioned that trees from greengenes database can be directly used if closed-reference is used. However in my case, I used open-reference, but I didn't use the de novo seqs, can I directly use the trees from greengenes?
Thank you so much for helping!
Best,
Lu