Hi guys! I've recently switched to a newer sequencing company offering good price and good depth of amplicon sequencing using the Illumina pe2500. As I'm starting a new batch of experiments I've been struggling with choices of region to sequence, I opted for the v3v4 (338f - 806r) to try out, the extra length up from v4 would be great if 2x250 is sufficient for most reads to merge, which was my biggest concern but the company assured me it would be fine and its a standard region to use with the pe2500. I've just received the results today, and the quality looks surprisingly good, actually unbelievably good from my understandings, thats why I'm here asking. Does this look normal?
This is the demux-summarize result of the fastq file with adapters trimmed, looks like theres no need for hard trimming at all? From my previous experiences using Illumina products they usually look more like this:
Am I just outdated and worrying too much or does it look reasonable? Any suggestions are appreciated, thanks!
That first picture you posted looks artifactual (not real data). Could you also post the demultiplexed amplicons? I wonder if there is a problem with the quality scores which should be within the reads.
Could you also let us know the commands you use to import the files?
You can try to use a “Cat” or “head” command and compare the headers from a fastq file (could be two inedependent files not the same sample run between these two companies).
Another way to trouble shoot may be to see if there’s changes to the Illumina PE format -> maybe there’s been a change where the quality score is reported.
Hi @ben ! Thanks for the quick response, thats what I wondered at first, but taking a closer look it looks even weirder, the files are too large to attach but Ill attach some screenshots, data in question:
my old data from another company which I consider normal:
I thought maybe Illumina had a technical breakthrough and solved the decreasing quality problem but from your response maybe not? Lol this look very weird, but now I've just tried assigning taxa with greengene database and the results came out fine with low unassigned levels and the features are pretty much where I expected them to be, which makes this issue even weider? If they doctored the results artificially, how did they overcome the decreasing quality issue? I didn't do any hard trimming and it has high matches agains the 99% gg database. This is so weird. Any suggestions?
It is weird, not sure why there’s a long chain of “FFFFFF…” where in your original files “6BCC…” are.
I wouldn’t allege that there’s anything nefarious or artificial, nor doctored. But, I do think that some of the quality information is missing. I think we need to do some deeper dive into the structure of these files. I think some of the quality data is missing.
If you look into these files, the quality data is following the “+” sign. Thus, the long change of "FF"s is the quality score.
edit: I would probably just contact the sequencing core that you’re using and ask them the question re: the quality scores showing them the fastq code. I think while there is a chance that those quality scores may be real, I think the lack of variation makes me suscept (as it did you). I wonder if either in the zipping/transferring these were changed to make it easier to transfer (less variation the more these can be compressed).
I think the fastq data you showed may not be a raw fastq sequence data. Because usually the length of sequence generated by illumina PE250 mode is 251bp in each of the reads. The forward reads you showed is obviously not the same length. So it may be cleaned before handling to you. Also illuminia Hiseq2500 could not provide those wonderful quality sequences ,even the newest platform Novaseq6000 could not. So I suggest you contact your service company to figure out what happened.