Question regarding parameters used in qiime vsearch join-pairs

cookingrice · November 3, 2019, 1:47am

Dear all, I am unsure as to what are the 'best' values to use for the following parameters: minovlen, maxdiffs, maxee.

The default value for minovlen is 10 but I have seen people using 50 - what is the significance of a higher value? Does a higher value yield better quality? Also, does it make a difference if the paired-end reads are highly overlapping? E.g. the paired-end reads (250bp) for the V4 region (290bp) - why can't the value be 250?

For maxdiffs, does a higher value mean that the result is less accurate since I am taking more mismatches into account?

Finally, I don't really understand what maxee means - 'Maximum number of expected errors in the joined read to be retained.'

Thank you in advance for your help.

colinbrislawn · November 4, 2019, 2:06pm

Hello @cookingrice,

Welcome to the forums! :qiime2: These are great questions about read pairing.

This tells the overlapping program, like vsearch, to only accept overlaps that are at least this long. So if you expect to have 50 bp of overlap between your reads, you don't want vsearch to return you an overlap of only 10 bp, as that's probably wrong.

Good catch! It totally does make a difference!
With EMP 16S V4 primers, the region targeted in about 250 bp long, so I expect ~50 bp overlap from 150 primers and ~250 overlap from 250 primers. I like to set my minoverlap to 30 bp and 220 bp for these settings, just to give vsearch a little room in case of insertions or deletions.

After an alignment over your minimum overlap length is found, the number of mismatches is checked, and if it's too high, the reads are dropped. So, technically yes, more mismatches means lower quality. But the reads have a chance to correct each other's errors, so quality is higher after pairing anyway. I usually increase maxdiffs.

(maxdiffs should also be adjusted based on overlap length. 10 differences in a overlap of 50 is 20% error rate, but 10 differences in a overlap of 250 is just a 4% error. If your overlaps are long, you should definitely increase maxdiffs!)

It's a new quality metric made up a few years ago to improve upon filtering by avery Q score.
https://www.drive5.com/usearch/manual/exp_errs.html

Let me know if this helped answer your questions. Welcome to the forums!

Colin

cookingrice · November 4, 2019, 8:53pm

Dear Colin, thank you so much for your help. I still have a few queries regarding your reply.

This tells the overlapping program, like vsearch, to only accept overlaps that are at least this long. So if you expect to have 50 bp of overlap between your reads, you don’t want vsearch to return you an overlap of only 10 bp, as that’s probably wrong.

Why would an overlap of 10bp be wrong? If we know for a fact that there is a 50bp overlap between reads, wouldn't setting an overlap of 10 be okay as the reads would still be accepted? It isn't strictly correct but surely there is no harm? (just curious)

Good catch! It totally does make a difference!
With EMP 16S V4 primers, the region targeted in about 250 bp long, so I expect ~50 bp overlap from 150 primers and ~250 overlap from 250 primers. I like to set my minoverlap to 30 bp and 220 bp for these settings, just to give vsearch a little room in case of insertions or deletions.

So let's say for 250bp paired-end reads where the overlap length is ~250bp, is it acceptable to set it lower, i.e. 50? (It might be the same question as above so I apologise if it is)

(maxdiffs should also be adjusted based on overlap length. 10 differences in a overlap of 50 is 20% error rate, but 10 differences in a overlap of 250 is just a 4% error. If your overlaps are long, you should definitely increase maxdiffs!)

What would be a good value for the 'error rate'? 20% (50 differences in an overlap of 250)

Thanks again and I apologise for the trivial questions - I'm struggling to understand the concept of paired-end reads!

colinbrislawn · November 5, 2019, 3:02pm

Hello again,

Well said, this should not cause harm and vsearch usually gets the correct length anyway. Sometimes when working with 250 bp reads on V4, I expect near total overlap, and vsearch gives me 100% matching overlap of just 12 bases, but that only happens for a few reads.

Yes, that acceptable. The mistakes I see with vsearch are very short overlaps, so min of 50 is probably really similar to min of 200 for an expected overlap of 250. You could try it both ways and see!

Sure! I'm not sure if someone has benchmarked this either, but that seems OK to me.

You don't have to worry about setting this 'too high' and getting overlaps that don't match at all; vsearch does not pair reads that are super-super different and it will say `alignment score too low, or score drop too high."

I appreciate your detailed questions. This is a great way to approach bioinformatics analysis.

The more you know!

Colin

colinbrislawn · November 9, 2019, 6:39pm

The conversation continues here: Following from question about minovlen

system · December 11, 2019, 12:40am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.