losing large amount of reads when using vsearch joinpairs

Johanna_Lisa_Bosch · May 10, 2022, 10:18pm

Hi there! I am running various sediment samples (Illumina for amplicon 16S - V4V5 sequencing) through the Qiime2 pipeline, however when I use vsearch joinpairs to join my paired end reads I lose a bunch of reads. For reference, here is a breakdown of my workflow for one sample:

Before trimming reads = 79 489 reads
after trimming = 75 542 reads
after joining = 28 005 reads
after filtering = 28 004 reads

Could anyone give some explanations as to why this may be happening? I figured there may be a few reasons but hopefully someone can help

the_dummy · May 12, 2022, 11:24am

Hello @Johanna_Lisa_Bosch,

Seems like there is a problem with overlapping bases, so I'd suspect the quality of the ends of both forward and reverse reads.

I think examining QC report to filter the reads better could make it retain more of the reads.

Good luck,

Kaan

Johanna_Lisa_Bosch · May 13, 2022, 7:09pm

I'm using paired-end reads, here is a breakdown of the first few steps of my workflow, the filtering step happened after the joining step, unless this is not the common practice?

I ran a QC and I know would trim around a length of 250 based on the phred score but otherwise I am not sure what parameters I would adjust in any of the following commands to improve this, as most of my reads are lost after joining. Wouldn't the trimming stage only take away the adapter sequences?

Here is the general workflow of my first few steps before constructing ASVs:

import with qiime tools import
trim with command cutadapt trim-paired
join with command vsearch join-pairs
filter with command quality-filter q-score
construct ASVs...

And here are the parameters for the vsearch join-pairs command, but I am unsure if this is what I would have to adjust to receive a higher read count, the trim command does not offer many parameter settings related to read length:

--p-truncqual INTEGER Truncate sequences at the first base with the
Range(0, None) specified quality score value or lower. [optional]

--p-minlen INTEGER Sequences shorter than minlen after truncation are
Range(0, None) discarded. [default: 1]

--p-maxns INTEGER Sequences with more than maxns N characters are
Range(0, None) discarded. [optional]

--p-allowmergestagger / --p-no-allowmergestagger
Allow joining of staggered read pairs.
[default: False]

--p-minovlen INTEGER Minimum overlap length of forward and reverse reads
Range(0, None) for joining. [default: 10]

--p-maxdiffs INTEGER Maximum number of mismatches in the forward/reverse
Range(0, None) read overlap for joining. [default: 10]

--p-minmergelen INTEGER
Range(0, None) Minimum length of the joined read to be retained.
[optional]

--p-maxmergelen INTEGER
Range(0, None) Maximum length of the joined read to be retained.
[optional]

--p-maxee NUMBER Maximum number of expected errors in the joined read
Range(0.0, None) to be retained. [optional]

--p-qmin INTEGER Range(-5, 2, inclusive_end=True)

the_dummy · May 17, 2022, 7:20am

If your aim is to construct ASVs, I'd recommend using DADA2 straightaway. Also, it includes options for trimming and quality filtering, making it a good choice for your case.

system · June 17, 2022, 1:20pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.