Strange demultiplexed sequence length summary

iuliachiciudean · June 29, 2020, 5:06pm

Hi,
So I am new to QIIME2 and after I believed that I manage to import my demultiplexed pair-end data by using:
qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path /home/iulia/Ips-larvae-B/Ips.larvae.B
--input-format CasavaOneEightSingleLanePerSampleDirFmt
--output-path demux-paired-end-VB.qza
And I proceed with joining reads (I thought this would be the next step) by using:
qiime vsearch join-pairs
--i-demultiplexed-seqs demux-paired-end-VB.qza
--o-joined-sequences demux-joined-paired-end-VB.qza
Now I am getting a very strange demultiplexed sequence length summary, all in red:

What would you advise me to do?
Thank you.
Iulia

Mehrbod_Estaki · June 29, 2020, 6:48pm

Hi @iuliachiciudean,

This really depends on what you are planning on doing with your data. If you are planning on using DADA2 to denoise your sequences, then you shouldn't merge your reads here, DADA2 will do that itself.
If you are planning on using Deblur to denoise your reads, or OTU clustering, then you are right to merge your reads first, but even then it is recommended that you ensure your sequences have gone through some sort of quality filtering step first. Check out this example for example in Moving Pictures tutorial.

As for what the red texts mean, check out this detailed explanation of the quality plots and let us know if you have any remaining questions. Thanks and happy :qiime2:'in.

iuliachiciudean · June 30, 2020, 3:05pm

Hi @Mehrbod_Estaki,
Got it! I understand the reasoning of not merging the reads now (I am planning on using DADA2). So I went on and generate a summary of the demultiplexing results.
demux-paired-end-VB.qzv (270.6 KB)
But I can not seam to understent how to use the quality plot to choose the parameters for `dada2 denoise method.
Based on my plot, should I choose to
--p-trunc-len-f 203
and
--p-trunc-len-r 227 ?
What about trimming? How to choose the trimming values?
I read some forum topics on this, but I am still unsure about it.
Thank you.
Iulia

iuliachiciudean · June 30, 2020, 3:29pm

Oh and other questions:
1). This "This position (228) is greater than the minimum sequence length observed during subsampling (2 bases)." means that I have sequences that are only 2 bases long?

2). In the “Moving Pictures” tutorial, why is --p-trunc-len set at 120?
The Median drops below 20 at 124?

Sorry for all those questions, but I am trying to undersend the logic of it.
Thx

Mehrbod_Estaki · July 1, 2020, 10:03am

Hi @iuliachiciudean,
Your demultiplexed summary is showing you only have a single sample. I don't think this is correct, right? You may need to revisit your previous steps to sort this out first before moving forward.

You just need to do a bit more reading on the forum . This is probably one of our most frequently asked/answered questions. One example here, and here. One note, min overlap requirement for merging has been changed from 20 to 12nt in DADA2.

In my first response above I shared a link that answers this in great detail, please have another look there.

The difference between 124 and 120 is probably negligible, so for my simplicity I imagine the tutorial just uses 120. Much like the pirate code, the "median 20 cut off point" is more like a guideline, not a rule. So as per the need of your data you can adjust that as you want.

But you should certainly try and figure out why you only have 1 sample first.

iuliachiciudean · July 3, 2020, 12:16pm

Hi @Mehrbod_Estaki,

Having one sample is not an error. I chose to work with the only 1 thill I can figure out Qiime.
Thank you very much for the examples you gave me. They were very useful.

So, my understanding goes like this:

First trim - if you have the primers or bad 5' seq..
Based on the quality score decide how far can you go with --truncation.
Then actually go as low (in term of the read length) as you can go with the --trunc value, in such a way that your reader will still overlap (at least 12nt). This because the sequences lower that your --trunc value will get discarded.
Trunc as low as you can, but still overlap.

Is this correct?
Thank you,
Iulia

Mehrbod_Estaki · July 4, 2020, 4:38am

Hi @iuliachiciudean,

I would recommend using a few more samples for a few reasons:

The dada2 error model won't perform very well if you have too few sequences
Most downstream diversity analyses you want to perform will require more than 1 sample to operate, so you basically will get as far as DADA2 before you run into errors not being able to do out analyses

As for the other items you listed, you are right on!
Happy :qiime2:in

system · August 4, 2020, 10:38am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.