Re: the question in the title, a few lines of evidence it doesn’t:
Attempting to merge the itsxpress output R1/R2 fastq files with vsearch results in 99% of reads failing with staggered merge.
Running ITSx on the individual R1/R2 itsxpress output files (after dereplication) results in 5.8S found in R2 for 62% of reads in my test data. (These were input in reverse so that R2 is actually the Illumina fwd read and is longer than R1, i.e., greater chance of read through).
L320 of the code is
return record1[start-1:], record2[r2start:] (i.e., there’s no end index).
Seems like it may be as simple as
return record1[start-1:tlen-stop], record2[r2start:tlen-start] if I’m understanding your code correctly?
I wanted to clarify your question a bit. I’m working on the adding support for reverse primer sets based on your previous question. Does this new question pertain to those reversed sets or normal primer sets?
For normal primer sets the code on 320 works fine and the the start and stop positions you reference are precomputed by the
ItsPosition class object and that class handles the logic about whether the
start variable and
stop variable equals the
from position from the HMM report. For reversed primer sets you do need to use a solution similar to the one you proposed: R1: [start -1:] and R2: [tlen-stop:]. I’m working on adding that, but it’s not as simple because it requires changing the interface of the
ItsPosition class which requires changing q2-itsxpress too. The good and bad of object oriented programming I suppose, changing private methods is easy, breaking a public interfaces is harder.
Thanks for your response @Adam_Rivers! Sorry for being unclear.
To rephrase the question:
How does itsxpress in
trim-pair-output-unmerged mode handle cases where the forward or reverse read extends through the conserved sequence on the 3` end of the read? For example, if my forward read starts in 5.8S, and reads through the LSU, will the LSU be trimmed in addition to the 5.8S?
I am using a dataset with reversed primers, but I don’t think this pertains to the specific primer set. The solution of inputting the reads in reverse order to the program seems to be working ok.
PS. Sorry I’m getting into the nitty gritty of your code here, was going to try to hack to use vsearch for merging instead of BBmerge and started looking at the indexing/math for the extraction step to see if that would require big changes…
No problem. Why are you trying to use vsearch?
Personal preference to some extent. I’m getting higher merge success (about 3-4% greater merge success) with vsearch (though I haven’t explored why), and I have a slight preference for vsearch just because it’s published/peer-reviewed, whereas (I think?) the BBtools suite still isn’t…
BBMerge is published now: https://doi.org/10.1371/journal.pone.0185056. It prioritizes accuracy and its fast, but it was designed with shotgun metagenomics in mind rather than amplicon sequencing which means that sometimes the stringency is not needed since crosstalk is less of an issue, and for some low quality datasets the merge rates can be lower than PEAR. I talked to the BBmerge developer, Brian Bushnell, about that and he added a parameter called maxmismatches which can be increased to get the same merging percentages as PEAR on lower quality data when I set it to 100. I haven’t done a direct BBmerge to Vsearch comparison, But maybe I should since it would decrease the number of dependencies in ITSxpress. I mostly went with BBmerge because of speed accuracy and the fact that I know and respect Brian. Vesarch is great too though.
I haven’t really worried about 3’ end trimming because any 3’ ends going past the selected ITS region and past the 5’ trimmed second read seem to be ignored by Dada2. If you have evidence they are not please let me know.
Re: staggered merge in DADA2 see L48 here and the discussion here. Its seems like default behavior in DADA2 R implementation is to not trim staggered merge. One potential solution would be to set
trimOverhang = T. Not sure what the default behavior would be in the qiime2 implementation, but it doesn’t look like the option to trim overhang is given (
qiime dada2 denoise-paired --help).
EDIT: (cc’ @Adam_Rivers)
After digging into this a bit more it seems like the
trimOverhang option in DADA2 is not the best solution to deal with staggered merge. According to the discussion I linked above and the DADA2 paper, leaving variable length 3’ read-through into conserved regions (SSU, 5.8S, LSU) could affect the dereplication step during the error profiling/denoising step of the algorithm, which occurs before read merging. Therefore, sequences that have identical ITS regions may not dereplicate together during the denoising step, potentially affecting ASV assignment. I’m guessing this wouldn’t have a huge affect on the results, but could depend on the dataset (e.g., difs in sequence quality that affect read-through length or consistency) which is not ideal.
I’m going to try setting
return record1[start-1:stop], record2[r2start:tlen-start] in the itsxpress code and see where that gets me.
(I think I had the indexing a bit mixed up in my original question, i.e., it’s not necessary to account for merged fragment length with
return record1[start-1:tlen-stop] if trimming 3’ of the fwd read and simply
return record1[start-1:stop] should work?)
Hi @Adam_Rivers, I worked some samples through the dada2 pipeline in R after either using unmodified
itsxpress to trim, or using a version of itsxpress modified to trim 3' ends as described above. I thought I would update here on some of the results in case it is useful for future potential itsxpress updates. Thanks again for this nice tool! I'll definitely use this as part of my future workflows.
It looks like there are small differences in the number of ASVs assigned based on the trimming strategy (272 vs. 276 ASVs for "standard" vs. 3' trimmed sequences, respectively, in 24 samples). There are also differences in the length of ASV sequences between the two methods. A few plots below.
1. Length of sequences after standard vs 3' trimming with itsxpress (1 sample as an example). These were input to dada2 pipeline.
The variation in sequence length between methods is in R2 (i.e., the read from the LSU primer in this case), whereas majority of R1 reads were trimmed to about 90 bp using either method.
2. Number of unique sequences after dada2 dereplication, denoising (i.e., sequence variant assignment), merging for ASV assignment, and chimera removal across 24 samples.
At the derep and denoise steps there is variation in number of unique sequences in reverse read but not forward. At derep the standard method generally has more uniques sequences (likely due to read-through length variation), whereas at denoise step the trend switches for some samples (due to including conserved sequence?) and is generally less pronounced. There is also a small amount of variation in the number of ASVs assigned between methods.
3. Length of unique ASV sequences. The standard method has a spike of ASVs at about 180 bp, whereas it appears trimming at 3' results in redistributing those ASVs around ~150 bp with a much larger spread in length (note log10 scale y-axis).
To me this suggests variable amounts of the conserved region (5.8S in this case) are included in the final ASVs when 3' is not trimmed, which can affect downstream taxonomic assignment, or any additional sequence similarity clustering steps performed after dada2 ASV assignment. This would also impact reproducibility of dada2 ASVs between datasets (i.e., consistent variation in read-through length between studies or sequencing runs could result in different ASV sequences).