Looking for advice on poor merging results with vsearch

tamardigrade · October 11, 2023, 10:07pm

Hi,

I am having poor merging results with vsearch in QIIME and after trying a few things I found on this forum that didnt really change anything I figured I'd reach out for advice via post. Any suggestions on how to boost merging results or are my reads just crap?

My code:

qiime vsearch join-pairs --i-demultiplexed-seqs ITS_semifinal_trim.qza --p-allowmergestagger --verbose --p-threads 4 --p-maxdiffs 30 --p-truncqual --o-joined-sequences ITS_semifinal_merged.qza --output-dir ITSmerged

Results:

Merging reads 100%
1821 Pairs
22 Merged (1.2%)
1799 Not merged (98.8%)

Pairs that failed merging due to various reasons:
1330 too few kmers found on same diagonal
469 alignment score too low, or score drop to high

colinbrislawn · October 11, 2023, 10:42pm

Hello Tammy,

You are trying all the right things!
--p-allowmergestagger --p-maxdiffs bignumber are what I try first.

Based on the name, it looks like you sequenced part of the ITS.

What region did you target and what's the size range of this region?

How long is your Illumina sequencing?

tamardigrade · October 12, 2023, 1:48am

Hi,
Thanks for such a fast response!

targeted region: ITS2

the size range of this region: 267-511 bp amplicon using the primers I used in this region (Figure 1 in the below paper)

Illumina sequencing length: 2 × 250 bp

Primers are from this paper: Taylor, D. L., Walters, W. A., Lennon, N. J., Bochicchio, J., Krohn, A., Caporaso, J. G., & Pennanen, T. (2016). Accurate estimation of fungal diversity and abundance through improved lineage-specific primers optimized for Illumina amplicon sequencing. Applied and environmental microbiology , 82 (24), 7217-7226.

Let me know what other info you need!

colinbrislawn · October 12, 2023, 1:26pm

OK, here's your problem:

Expected overlap = Illumina reads - amplicon length
Expected overlap = each Illumina read x2 - amplicon length
Expected overlap = 500 - (267 to 511)
Expected overlap = 500 - (267 to 511)
Expected overlap = 233 to -11

That is a huge range! In practice, it's less variable, but still!

If most reads are shorter, trimming off the low-quality ends before merging may work best.

Can you post your quality scores after importing so we can take a look?

tamardigrade · October 12, 2023, 4:10pm

Here is a screenshot from before I removed 2 samples with 0 reads. The code to make a new quality score visual is taking so long! (Another sign of something off?)

Would I trim off the low quality ends with trimmomatic? Or what other software?

Please let me know if this is not what you need!

Many thanks,
Tammy

colinbrislawn · October 12, 2023, 6:24pm

Thanks for posting that, Tammy,

Uh... it's not just the ends that have low quality. The quality is highly variable throughout, which I guess is a common problem with variable length regions.

I like your idea of using trimmomatic (or vsearch itself) to cut off the ends of reads once their quality drops. This has to support variable length trimming per-read because read length and quality also varies per-read.

Let me know what you try next.

tamardigrade · October 16, 2023, 8:27pm

Hi,

I tried trimmomatic and trimmed the reads down to 230 bp. I ran the following code:

qiime vsearch join-pairs --i-demultiplexed-seqs ITS_semifinal.qza --p-truncqual 12 --p-allowmergestagger --verbose --p-threads 4 --p-maxdiffs 30 --o-joined-sequences ITS_semifinal_merged.qza --output-dir ITSmerged

and got this output:

Merging reads 100%
56478 Pairs
13882 Merged (24.6%)
42596 Not merged (75.4%)

Pairs that failed merging due to various reasons:
33098 too few kmers found on same diagonal
9477 alignment score too low, or score drop to high
21 overlap too short

Here is the multiqc file of mean quality scores:

So trimming helped, but didnt finish the job.
Questions:

Would trimming more off help?
What other suggestions do you have?
Does the most common error, ' too few kmers found on same diagonal' mean that the reads just arent similar enough to overlap??

colinbrislawn · October 16, 2023, 8:49pm

Thanks for the update. I'm glad some of the reads are merging.

Would trimming more off help?
Maybe!
What other suggestions do you have?
I'm not sure....
Does the most common error, ' too few kmers found on same diagonal' mean that the reads just arent similar enough to overlap??
Yes. This is an explanation of why it failed: the read pair can't join because it can't align because too few kmers were found. And this is expected sometimes. Remember:

An overlap of -11 is a gap of 11. You can't join when there's no overlap.

Choosing to join will functionally filter for short amplicons because only those will overlap.

Perhaps it's best to analysis this data set twice. Once with unjoined reads, and again with paired reads. This will let you view the data with and without length bias.

tamardigrade · October 17, 2023, 7:27pm

Hi,

Thanks for such a fast response to all my post. It makes a difference!
As for the protocol with unjoined reads- I would just continue data analysis as usual, skipping the merge pairs step and heading to the quality filtering step? I've been trying to look for protocols to follow and not having much luck.

Many thanks,
Tammy

colinbrislawn · October 19, 2023, 5:02pm

Yes. You could also go back to the importing step and import as single-end fastq files. Then you process this as if you had only of the paird-end reads

system · November 19, 2023, 11:03pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.