Summary generated after importing data

I have read a previous post by alexximalaya regarding a similar issue, but my data seems to have more variables then alexximalaya.

So, I have this msg for my forward read:
The plot at position 143 was generated using a random sampling of 9999 out of 16595545 sequences without replacement. This position (143) is greater than the minimum sequence length observed during subsampling (142 bases). As a result, the plot at this position is not based on data from all of the sequences, so it should be interpreted with caution when compared to plots for other positions. Outlier quality scores are not shown in box plots for clarity.

And this for my reverse read:
The plot at position 66 was generated using a random sampling of 9996 out of 16595545 sequences without replacement. This position (66) is greater than the minimum sequence length observed during subsampling (53 bases). As a result, the plot at this position is not based on data from all of the sequences, so it should be interpreted with caution when compared to plots for other positions. Outlier quality scores are not shown in box plots for clarity.

Basically half of the forward reads are in red while 2/3 of the reverse reads are in red. What I understand is that this message means that the length of my sequences is not tally. So, I am not sure how am I going to select the point to trim my sequences and will the results still reliable? Any advice from you guys?

Rather, one or more forward reads are < 50% full-length, and one or more reverse reads are < 33% full-length, if I understand your description.

Reads should not be variable length, unless if they have been pre-trimmed/filtered. Some sequencing companies perform such trimming automatically, but it may still be possible to get the raw, full-length reads from them. I’d suggest getting in touch with them to get the raw reads if you can. It will take out a lot of the guesswork.

Do the trimming as you normally would. You will lose these shorter reads, but if they were trimmed so short they may be bad reads to begin with. Pay close attention to the denoising summaries to see how many reads are input and how many are lost at each step of denoising. Check out feature-table summarize on the output feature table to make sure you have enough sequences per sample. As long as you have enough reads output, results will still be reliable, since sequence errors (and hence reads filtered out) will be randomly distributed across reads.

(this might contradict advice that I or others have given elsewhere, but this is my current thinking after dealing with a number of these issues. it’s worth trying and we can try plan B if that does not work).

I hope that helps!

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.