ITS extraction after joining paired-end reads?

gabt · September 28, 2022, 8:06am

I am quite new to the field of fungal amplicon analysis and I had many discussion whether to use DADA2 or Vsearch with OTU-based approach, to obtain the famous count table. As with all the things in bioinformatics there is no consensus, no gold-standard, only "the way we do it is...". So, the only thing one can do is perform the same analysis with ASV and OTU and hope the results are similar.

On top of this, I was told that joining R1 and R2 before extracting the ITS leads to better results, when I follow the OTU way. Recently, I posted about joining R1 and R2, but there are issues with the score of the merged sequence. So, I thought, why not joining? Indeed, fastq_join is not overlapping R1 and R2, the score is not affected, but there is a padding sequence in between, which, I suppose, is right in the middle of the ITS sequence itself which I want to extract. From vsearch's manual:

--fastq_join filename
Join paired-end sequence reads into one sequence and add a gap between them using a
padding sequence.

Question: will the padding affect the extracted ITS sequence and the following clustering at 97%/99%/whatever threshold? What about the blasting?

colinbrislawn · October 10, 2022, 7:23pm

Yes, having discontiguous reads (with a NNN gap in the middle) will effect clustering, denoising, and taxonomy assignment. Some programs will handle this well, others will not and could break if you have any Ns in your sequences at all. (The VSEARCH option --fastq_join is not even supported in Qiime2 right now, though it may be added.)

You have run into a common problem: ITS reads can be hard to join due to the variable length of amplified ITS regions. One easy solution is to process the forward and reverse reads separately and compare the results.

gabt · October 13, 2022, 8:08am

@colinbrislawn thank you for your reply. Your suggestion of analysing R1 and R2 separately is interesting. Somewhere else, don't remember where, I found someone suggesting to use only forward reads but that was kind of throwing away half the information we have so I had people that were not really convinced. I may try to do as you say, be lucky and get similar results.