I have a question regarding VSEARCH and how it detects chimera as implemented in Q2. In DADA2 we can choose if detection should be limited within each sample (consensus), or if all samples should be pooled before identifying chimeric sequences. So what does VSEARCH do? Pr sample? Pr Run? Pr what-ever-is-in-your-qza-file?
Good questions. q2-vsearch has two different chimera filtering methods: reference based (using vsearch’s uchime_ref method) and de novo (using vsearch’s uchime_denovo method).
For both of these, I believe chimera checking will occur on a what-ever-is-in-your-qza-file basis, since this is performed on sequences in a FeatureData[Sequence] artifact. A feature table is used as input to determine the frequency of each sequence, but as far as I can tell the sequences are still passed to vsearch all together.
For more details on what vsearch does with those sequences, see the vsearch docs.
The reason why I started wondering about this is because we have several runs - each going through DADA2 separately. But if we wish to use vsearch for further chimera detection - I am wondering what we risk if choosing the effortless solution of merging all runs, then running vsearch once on a large dataset ( vsearch’s uchime_denovo). Rather than repeating this step for each run (I think we can end up having at least 30). Could this result in a high rate of false-positive chimeric sequences you think?
The vsearch devs have a recommendation about this! They call it an 'Open Question' but I'm pretty sure everyone does it at the study level, which I think is the effortless solution you mention.
Merging runs might also reduce the false negative rate because the additional coverage in sparse OTUs will mean that the parent of a chimera will be in the database so the child chimera can be removed.
I'm not sure which is best... but it's a good question!