Default parameters on vsearch join-pairs

cmarotz · October 17, 2018, 11:51pm

Hi,

I have paired end data that I want to join using vsearch so that I can run my samples through deblur.
The default parameters of vsearch join-pairs join reads that have a minimum overlap of 10 bp, with a maximum mismatch of 10 bp?? :

--p-minovlen INTEGER RANGE Minimum overlap length of forward and
reverse reads for joining. [default: 10]

--p-maxdiffs INTEGER RANGE Maximum number of mismatches in the
forward/reverse read overlap for joining.
[default: 10]

I hope that this is a typo and that --p-maxdiffs means the maximum percentage of mismatches in the forward/reverse overlap (i.e. 1bp)?

Lastly, I have seen in the tutorial and previous forum posts that quality-filtering is done after joining reads, but would it not be better to quality-filter before joining?
Thank you for your input!

Nicholas_Bokulich · October 18, 2018, 1:09pm

Hi @cmarotz,
Great questions!

You are correct — those parameters do not make much sense together — however, we have chosen those parameters because this method is a wrapper of VSEARCH, which has set those as the default values for those parameters for read joining. In general, we try to keep default settings intact when wrapping external tools, but I agree this one might require some modification! @gregcaporaso may have other thoughts about this.

The VSEARCH manual appears to wave away the concern about these clashing parameters:

v2.6.0 released November 10th, 2017
Rewritten paired-end reads merger with improved accuracy. Decreased default value for
fastq_minovlen option from 16 to 10. The default value for the fastq_maxdiffs option is
increased from 5 to 10. There are now other more important restrictions that will avoid
merging reads that cannot be reliably aligned.

But it is not really clear what those restrictions are. You can read the manual to learn more.

This is a good example of why it's usually good not to just run commands with the default parameters — these settings are intended to be adjusted, and are not necessarily "optimal" for all situations.

Following that tutorial and using deblur or OTU clustering, then yes quality filtering is done after joining. But if you use dada2, quality filtering/correction is done prior to joining in the dada2 denoising pipeline.

I agree with you, and prefer the dada2 approach to joining, since filtering on the raw reads just makes more sense to me. However, others prefer joining first since it performs its own kind of quality filtering — improving prediction of the overlap sequence by aligning two reads. E.g., see here and the linked paper for more description of how sequence (and quality) predictions are made in overlap regions using VSEARCH.

I hope that helps!

cmarotz · October 18, 2018, 6:28pm

Thanks Nicholas!

This documentation is helpful, I am playing around with the joining parameters now.

Also, I think I found the answer to my second question; when I ran qiime quality-filter q-score on my paired end sequences, the output was SampleData[SequencesWithQuality], and then could not be used to join since its no longer 'paired-end'. Maybe there is an issue trying to quality filter paired end reads while keeping them in order for joining and that's why people quality filter after joining?

Nicholas_Bokulich · October 18, 2018, 6:37pm

That may be correct, though it's not intractable (I did not write that action so don't know the rationale)... if you lose a sequence during QC in only one pair, you also need to drop the other! So one possible rationale is that joining prior to QC allows a higher-quality sequence to "save" the lower-quality sequence so that the pair is retained.

Also, note that qiime vsearch join-pairs exposes parameters for basic filtering (e.g., trim based on q-score). So it is possible to do some cursory QC prior to joining.

system · November 19, 2018, 12:37am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.