Hi.
I finished processing my data and was looking to cluster the OTUs by distance instead of glomming them by species. I noticed my representative sequences were different lengths, is that normal? And if so, why?
Best,
Anna
Hi.
I finished processing my data and was looking to cluster the OTUs by distance instead of glomming them by species. I noticed my representative sequences were different lengths, is that normal? And if so, why?
Best,
Anna
Hey @akknight216,
What method did you use for generating the sequence variants? To my knowledge dada2
should have given identical length SVs, however I am much less familiar with deblur
which is the other option at this time.
I used dada2, the sequence lengths range from 250 to 469. It was V3-V4 sequencing, 300 bp paired end reads.
Ok, your paired-end data would be the reason you’re seeing different lengths. Both directions are overlapped to create a joined read that becomes your new SV and the paired reads may not overlap in precisely the same position each time.
To add to what @ebolyen said you are seeing different lengths because of the paired end joining, while things are trimmed to a specific length before joining there is no guarantee that the overlapping region is going to be the exact number of base pairs in every instance, hence the variable length in the output sequences.
I'm curious as to how you are seeing greater than 300bp reads in a 300bp region. From the DADA2 documentation:
large length differences in the 16S region, especially after merging, are typically caused by alternate priming from non-specific primers. If this is not controlled for, this can produce false variation in the output.
In your case it would likely be worth investigating, if you haven't done so already, the validity of the longer sequences. Despite the fact that the longer sequences may lead to false conclusions, they may contain useful information so it may not make sense to disregard them outright
Thanks @John_Chase for noticing the range, I had misread it a 269 originally and thought it was a ~19 base discrepancy!
In the dada2 documentation, we advise people who want to avoid off-target priming to “cut a band in silico” by selecting amplicons of the expected length range, eg. between say length 245 and 260 for the common V4 primers. This is done with basic R commands there, is there a simple way to do the same thing using a QIIME2 command?
Also to follow-up on @John_Chase while we usually remove amplicons with off-target lengths, we have also sometimes seen interesting biology there. In one case we picked up a number of eukaryotic microbes with amplicons longer than the expected bacterial lengths, one of which was the causative organism for a disease relevant to our health condition of interest!
I don't think there is, but it seems like a method which would make sense on q2-quality-filter
. @wasade, what do you think about having a method on there for basic length-based operations?
This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.