How to merge Paired end sequencing while overlap is short

Dawud922 · February 3, 2020, 7:42pm

Dear experts
I have data from 16S rRNA sequencing, paired-end from v4 region. The sequencing is done at 2 x 150 read length.

I used DADA2 to merge them but came up very few sequences are left to process for downstream analysis, then I've been troubleshooting and found with my 815F&806R primer at 2x150 read length my overlap region is going to be only 4 to 5 nucleotide and wonder if this is the reason for having very few sequences left after merging?

How should we merge R01 and R02 in my situation? I read some discussion on google saying the forward and reverse fastq files contain reads in matched order. Does anyone know what does reads in matched order? if it's matched when sequncing can I just merge my R01 and R02 with mergePairs(..., justConcatenate=TRUE in R. And can we do the same in Qiime?

Thanks

Mehrbod_Estaki · February 4, 2020, 12:13am

HI @Dawud922,
A very recent discussion regarding the justConcatenate option you mentioned which I advise against. Assuming you meant 515f/806r primer set, you are right that the overlap with a 2x150 run is not sufficient for proper merging which would explain why dada2 is failing to merge these.

You shouldn't... I would just discard your reverse reads and use the forward reads only moving forward.

Dawud922 · February 4, 2020, 6:41pm

Thank you Mehrbod for your suggestion. I'll go forward with just my R01 reads.
Just out of curiosity, are R01 and R02 reads are matched in order in fastq file? It's still not clear in my mind how the sequencing machine read out the data and write the fastq files while sequencing.

Mehrbod_Estaki · February 4, 2020, 10:41pm

Hi @Dawud922,
Yes, the reads in your forward and reverse reads (and barcode file if you had them) are matched as in they are in the same order, meaning all line 1 in all those files correspond to the same sample/read.
Hope that clarifies this for you.

paul1 · January 20, 2021, 2:01am

I am thinking about using a protocol that uses paired end reads on a region that is 700-900 bp in length, so no overlap. Since I would just be using the results to identify a species by comparing it to a reference, I don't see much risk in using the justConcatenate=TRUE since I am not interested in the actual length of the region. The ends could be truncated and the program would throw some NNNNs in the middle. Seems fine? Am I missing something?

One thing is that the overlap gives some assurance that the two reads are truly in 'matched order' but I am not sure often they are mismatched from the sequencer. Would there be other concerns to consider, given this context?