Thanks a lot for clearing all that up for me - very useful! I would like to further this though…
I am experimenting with different filtering parameters etc to try to get the most biologically significant representation of the features contained per sample, however I am struggling to decide which variables and values produce the best feature library. With DADA2 I get the following outputs:
With vsearch i get the following, under different parameter values (see table headings):
The taxonomy files are as follows:
DADA2 = 270-220-rep-seqs-taxonomy.qzv
Vsearch = paired-end-demux-trimmed-joined-filtered-q30-600-rep-seqs-uchime-ref-out-rep-seqs-nonchimeric-w-borderline-taxonomy.qzv
The associated commands are in the attached file (filtering commands.txt - please ignore file names, they are just different iterations of filtering)
The taxonomic classifications produced from the files for each (DADA2 vs Vsearch) command workflow are pretty dramatically different (see screenshots and files) and seem in contrast to the filtering statistics - DADA2 appears to contain a huge amount MORE diversity (as in individual ASVs) whereas the Vsearch files appear to contain only a relatively small number of OTU classifications when considering how many sequences are retained through filtering. - how can this be? Any suggestions on how I should investigate this difference and what is driving this difference? Is it filtering parameters or clustering? Could a difference of this magnitude be the result of just the difference in clustering algorithm (using sequence variation and error-correction model vs using 98% similarity threshold)? Or do the vsearch files just retain a huge amount of replicates/error-repeats or noisey sequences that are all grouped under the same taxonomic affiliation/consensus sequence? In this sense it looks like the samples contain loads of trash and the DADA2 is simply removing all that noise and retaining sequence diversity correctly relative to vsearch (which looks like the filtering is done under the wrong parameters), but the initial quality statistics show the data is pretty good quality and without examining in detail I do not know how to answer these questions/figure out where or what is happening? Do you have any suggestions on how I can investigate this anomaly further, or how I should go about addresssing this problem and finding a solution?
Additionally, I also ran the DADA2 ASV file through the Vsearch open-reference clustering command to cross-reference clustering granulation and I get a reduction in feature count from 4,944 to 4,249 but retain feature abundance - assume this is due to the different strategies dealing with sequences within 2% similarity (using 98% threshold) - but does this help clarify what is happening with the filtering? I think this is quite a large reduction in ASVs but is this magnitude of reduction normal (ie would we expect >700/4944 seq to be within 2% similarity from samples like our/microbiome samples?)? Does this indicate that it is the different filtering parameters generating the difference in taxonomies and not the clustering?
filtering commands.txt (5.7 KB)
All the files referenced are in this onedrive link:
If you could give any advice, suggestions or comments at all that would be amazing - I am really lost here!
Thanks and have a great day guys,