I am having poor merging results with vsearch in QIIME and after trying a few things I found on this forum that didnt really change anything I figured I'd reach out for advice via post. Any suggestions on how to boost merging results or are my reads just crap?
the size range of this region: 267-511 bp amplicon using the primers I used in this region (Figure 1 in the below paper)
Illumina sequencing length: 2 × 250 bp
Primers are from this paper: Taylor, D. L., Walters, W. A., Lennon, N. J., Bochicchio, J., Krohn, A., Caporaso, J. G., & Pennanen, T. (2016). Accurate estimation of fungal diversity and abundance through improved lineage-specific primers optimized for Illumina amplicon sequencing. Applied and environmental microbiology , 82 (24), 7217-7226.
Here is a screenshot from before I removed 2 samples with 0 reads. The code to make a new quality score visual is taking so long! (Another sign of something off?)
Would I trim off the low quality ends with trimmomatic? Or what other software?
Uh... it's not just the ends that have low quality. The quality is highly variable throughout, which I guess is a common problem with variable length regions.
I like your idea of using trimmomatic (or vsearch itself) to cut off the ends of reads once their quality drops. This has to support variable length trimming per-read because read length and quality also varies per-read.
Pairs that failed merging due to various reasons:
33098 too few kmers found on same diagonal
9477 alignment score too low, or score drop to high
21 overlap too short
Thanks for the update. I'm glad some of the reads are merging.
Would trimming more off help?
Maybe!
What other suggestions do you have?
I'm not sure....
Does the most common error, ' too few kmers found on same diagonal' mean that the reads just arent similar enough to overlap??
Yes. This is an explanation of why it failed: the read pair can't join because it can't align because too few kmers were found. And this is expected sometimes. Remember:
An overlap of -11 is a gap of 11. You can't join when there's no overlap.
Choosing to join will functionally filter for short amplicons because only those will overlap.
Perhaps it's best to analysis this data set twice. Once with unjoined reads, and again with paired reads. This will let you view the data with and without length bias.
Thanks for such a fast response to all my post. It makes a difference!
As for the protocol with unjoined reads- I would just continue data analysis as usual, skipping the merge pairs step and heading to the quality filtering step? I've been trying to look for protocols to follow and not having much luck.
Yes. You could also go back to the importing step and import as single-end fastq files. Then you process this as if you had only of the paird-end reads