Joining paired end reads using QIIME2 , QIIME1.9 and usearch10

mol · July 10, 2018, 4:25pm

I used my 16s rRNA paired end reads to compare the joining-paried-end-reads methods,including QIIME2(vsearch), QIIME1.9(fast-join), QIIME1.9( SeqPrep) and usearch10 . All the parameters were default. The results shows below:

           No. of raw data      QIIME2(vsearch)      QIIME1.9(fast-join)  QIIME1.9(SeqPrep)    usearch10   
sample1        31487                  7883                   20566             28966          1689
sample2        43406                  11435                  29377            39957           2458
sample3        33070                  8937                   22612             30736          2226

Why did the results show high variance among these methods？ which method should I choose to analysis my data?
Thanks!

Nicholas_Bokulich · July 10, 2018, 6:27pm

Probably because the default parameters and algorithms are very different between these methods.

The high variability makes me think something else is going on.

Did you perform any kind of read trimming or quality filtering prior to joining?

You should evaluate the joining, too, if you are concerned that they may be joining differently. Look at read length distributions and compare those to your expected amplicon length distribution (keep in mind that some level of length variation exists for 16S, so there is a distribution).

You could also process all of these through QIIME2 to see if, e.g,. you get more unclassified reads from the qiime1 methods (implying that these are bad joins). (as an alternative, use vsearch or another global aligner to see how well these align to full-length 16S reads).

If length distribution and taxonomic abundances/alignment look okay, I'd use the method that gives me the most reads!

SoilRotifer · July 10, 2018, 6:50pm

To echo the points by @Nicholas_Bokulich, check out some of the older but related discussions in the QIIME 1 forum:

join_paired_ends.py help documentation
See this QIIME 1 forum thread: join_paired_ends.py questions
See this QIIME 1 forum thread: number of reads after joining

I think there are several other threads in the QIIME 1 forum. But these should help get you started. You can browse through mergepairs parameters for usearch. Also check out this post on why there may be differences between vsearch and usearch.

In brief, I try to trim off the low-quality tails prior to joining the reads. This will reduce the number of mismatches, as if there are to many the merge will fail. This is one reason why DADA2 takes this approach, as outlined here. Also, it is generally a good idea to remove primers from each of the reads prior to merging, especially if you get read-through, which can also cause mismatches during merging. In which case you may need to check if both of your primers appear in each read separately.

-Hope this helps!
-Cheers!

mol · July 11, 2018, 5:40pm

Thanks!
I tried to change the parameters to keep them same. QIIME1.9(fast-join and SeqSrep) and usearch10 got similar results, vsearch still howed high variance. I removed the primers for the data piior to join, and dindt do quality filtering.

mol · July 11, 2018, 5:40pm

Thank you！
I tested to trim off the low-quality base using parameter '--p-truncqual' in QIIME2 and '-fastq_trunctail' in usearch10. The results showed below:

                                                  sample1              sample2

rawdata 31487 43406
QIIME2(--p-truncqual=33) 2601 3864
QIIME2(--p-truncqual=20) 12135 17478
QIIME2(--p-truncqual=10) 20046 27966
QIIME2(--p-truncqual=5) 27291 38214
QIIME2(--p-truncqual=2) 27291 38214
QIIME2(--p-truncqual=1) 7883 11435
QIIME2(--p-truncqual=0) 7883 11435
QIIME2(default) 7883 11435
usearch(-fastq_trunctail=30） 30187 41582
usearch(-fastq_trunctail=20） 28187 39220
usearch(-fastq_trunctail=10） 23324 33123
usearch(-fastq_trunctail=5） 20344 29222
usearch(-fastq_trunctail=defalut） 20344 29222

It showed differenct trends with increasing cutoff between QIIME2 and usearch.
Is there something wrong durning my processes?

Nicholas_Bokulich · July 11, 2018, 6:38pm

The results are pretty similar in the Q=5-10 range. You did not test lower for usearch (default is obviously Q=5) but I'd expect similar.

read yield drops off in the low range because as @SoilRotifer described there is too much bad sequence there, preventing suitable alignment.

read yield drops off in the high range because you would be trimming off too much sequence and the reads are unlikely to overlap at all since there are not overlapping tails!

Higher is not corresponding between usearch/vsearch — this really comes down to differences between those algorithms (QIIME 2 is just wrapping vsearch here, not doing anything special), which as far as I know are supposed to be very similar (perhaps not for read joining, though).

Without seeing other evaluation evidence (length distribution, classification/alignment to reference) I would actually trust the vsearch yields more here, based on my explanation above. usearch might just be gluing together two reads that don't actually overlap — there may be different minimum overlap parameters or something along those lines.

No, probably not. I think this still looks like different parameters and/or possibly differences in the algorithms.

In any case (again, lacking evals like read length and alignment to reference), joining with --p-truncqual in the Q=5-10 range looks pretty good! Looks like you've done a good evaluation to figure out what works best for your data!

SoilRotifer · July 11, 2018, 8:10pm

As @Nicholas_Bokulich mentioned, QIIME 2 is simply wrapping vsearch. You may have missed that I had updated my original post with this additional caveat:

This should help to answer your questions about usearch vs vsearch merging.

-Best wishes.

mol · July 12, 2018, 12:03pm

Thanks.
I have got most of the points ablout the question. However, I used the same data and the same parameters to test QIIME2(vsearch) and usearch. In QIIME2(vsearch), when the Q increased(except 0 -1), the reads yield drop off. In usearch, when the Q increased, the reads yield increased. I thought the only reason is the algorithm?

Nicholas_Bokulich · July 12, 2018, 7:24pm

Probably yes. As I explained above (or at least predicted, since I have not seen read length distribution results to evaluate):

Nicholas_Bokulich:

read yield drops off in the high range because you would be trimming off too much sequence and the reads are unlikely to overlap at all since there are not overlapping tails!

Higher is not corresponding between usearch/vsearch — this really comes down to differences between those algorithms (QIIME 2 is just wrapping vsearch here, not doing anything special), which as far as I know are supposed to be very similar (perhaps not for read joining, though).

Without seeing other evaluation evidence (length distribution, classification/alignment to reference) I would actually trust the vsearch yields more here, based on my explanation above. usearch might just be gluing together two reads that don’t actually overlap — there may be different minimum overlap parameters or something along those lines.

So check out the length distributions after joining. I suspect usearch may be gluing together two non-overlapping reads. The vsearch results make more sense to me, because aggressive Q-score trimming will lead to shorter sequences and lower likelihood of successful joining.

system · August 13, 2018, 1:24am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.