Thank you @wasade and @colinbrislawn for your comments. Will definitely try DADA2 -> ASR but still like to know what happened to my vsearch run.
@wasade, I made a mistake in counting the % recruitment. I reran several parameters and my updated stats are described below:
Firstly, my data is here: (link removed) (730.3 KB) And here are my clustered_table.qzv (343.8 KB) and unmatched_sequences.qzv (1.5 MB) from my Greengenes 99otu --p-perc-identity 1 run. My vsearch command is
qiime vsearch cluster-features-closed-reference --i-sequences rep-seqss/rep-seqs_subbatch-1.trunc430_ee6.qza --i-table tables/table_subbatch-1.trunc430_ee6.qza --i-reference-sequences 99_otus.qza --p-perc-identity 1 --p-threads 0 --output-dir closedRef_forPICRUSt_subbatch-1
In clustered_table.qzv, it says the Number of features is 1383, I count this as the number of recruited features. For the unrecruited features, I don't know how to directly count them. I used a theoretical maximum number of features by running 99_otus using a very low --p-perc-identity 0.8, so that all reads are clustered (i.e. unmatched_sequences.qza is empty), and subtracted using this number. Please see this table (and a lazy graph) for the results of my benchmarking:
For example, the maximum number of OTUs for 99_otus is 3825 (trial 4), when running with --p-perc-identity 1, the % recruitment is 1383/(1383+2442) = 36%.
The 16% I previously reported was because I counted the number of unrecruited features by exporting unmatched_sequences.qzv to get a dna-sequences.fasta file (please get this by viewing unmatched_sequences.qzv and downloading FASTA from there), then made sure it doesn't contain duplicate sequences, then simply counted the number of sequences, which equals 7319. So the old % recruitment was 1383/(1383+7319) = 16% but that is now obviously wrong (I found out because the sum of number of unrecruited and recruited features is inconsistent across different --p-perc-identity).
You may have noticed in the name of my rep-seqs file that I allowed --p-max-ee 6 in the reads, would this have affected your estimation?
I am using Greengenes 13_8.
In the vsearch command, the --i-reference-sequences is from gg_13_8_otus/rep_set/99_otus.fasta .