Visualizing Paired-end sequences

abhishake · September 12, 2017, 4:50pm

Hello,
I am having a difficulty in visualising my paired-end data. I think my sequences have been imported properly. I used the following command:

qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]'
--input-path man_single.txt
--output-path OC01.qza
--source-format PairedEndFastqManifestPhred33

Then I used the demux summarize:

qiime demux summarize --i-data OC01.qza --o-visualization visualisation/OC01.qzv

Finally to visualize the sequences:

qiime tools view visualisation/OC01.qzv

Here my problem occurs as in both the windows of the sequences something like this is written just under the graph:

The plot at position 181 was generated using a random sampling of 9025 out of 1375565 sequences without replacement. This position (181) is greater than the minimum sequence length observed during subsampling (40 bases). As a result, the plot at this position is not based on data from all of the sequences, so it should be interpreted with caution when compared to plots for other positions. Outlier quality scores are not shown in box plots for clarity.

What I cannot understand is why the minimum sequence length during subsampling is kept 40 here?
My sequences are 300 bp long. And something like this appears for both the forward and the reverse read.
Apparently in both moving-picture or fmt tutorial nothing like this appears.

Thank you,
Abhishake.

jairideout · September 12, 2017, 6:04pm

Hi @abhishake! When demux summarize randomly subsamples sequences to plot, it looks like it’s finding at least one forward or reverse sequence in your data that is 40bp in length (demux summarize doesn’t do any sequence truncation itself). Are you able to share your OC01.qza file with me (e.g. hosting it on Dropbox, Google Drive, etc.)? You can share the link via a direct message to me.

If that isn’t possible, I recommend using qiime tools export on your .qza and inspecting each of the exported per-sample FASTQ files to see if you can find any 40bp sequences. Thanks!

abhishake · September 13, 2017, 4:45pm

The link for the file is here:
https://drive.google.com/drive/folders/0B3HWOGmNhUBsbG9OdXZiSWNDM1E?usp=sharing

Taking hint from your suggestion I did find out the length distribution of the sequences and found out that indeed there are many sequences with length as low as 40bp.

Now, previously I would use join_paired_ends.py and set the --min_overlap 100 which would filter out these sequences. So what should be the right way to go about then? My primary goal is to denoise my sequences using dada2 or deblur.
Or is there any way to import already merged paired end sequences?

jairideout · September 14, 2017, 12:06am

Thanks for the file and debugging this on your end! Since these are FASTQ files, I'd expect the sequences to all be roughly the same length (i.e. as produced by Illumina sequencers). How was the data generated, and do you know why the sequences have such large differences in length? I ask because this might point to a bigger problem with your data and I don't want to lead you down the wrong path!

You can use qiime dada2 denoise-paired to denoise and join your paired-end data. My understanding is that Deblur will only operate on the forward reads if you pass it paired-end data. Either way, it'd be good to sort out the sequence length issue I mentioned above before proceeding too far.

You can import sequences that have already been merged by importing them as single-end reads. However, DADA2 may not perform well with data that's already been joined by another program, so you'll want to keep your reads unjoined in that case. Deblur works fine with sequences that have already been joined.

Another option is to use QIIME 1 to do your read-joining and OTU picking, and then import the resulting feature table (.biom file) and representative sequences to continue your downstream analyses in QIIME 2.

system · October 15, 2017, 6:06am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.