Denoising of Paired End Reads

rahul · January 16, 2020, 11:38am

Hello!

I am am stuck with a problem. I am using dada2 for denoising of paired end Illumina reads of read length 150 bases each (both R1 and R2). The quality of the reads are good and consistent throughout the read length.

My first issue is as below:
When i am importing the reads all of the reads are getting imported. But when i am trying to vizualize the data post conversion, the quality column remains blank. However when i export the csv file the files have some quality scores.

My Second Issue is mentioned below;
I am using dada2 for denoising the reads I am only getting 90 reads post denoising. I am not able to understand why this is happening. It is very urgent for me to resolve.
I hereby paste the command that i used for denoising the reads. Please take a look
[The reads have Phred 33 encoding:]

qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza --p-trim-left-f 1 --p-trim-left-r 1 --p-trunc-len-f 150 --p-trunc-len-r 150 --o-table table.qza --o-representative-sequences rep-seqs.qza --o-denoising-stats denoising-stats.qza --p-n-threads 14

Please help me in sorting this issue as soon as possible. I thank you in advance.

colinbrislawn · January 16, 2020, 3:23pm

Hello Rahul,

Would you be willing the post the quality score plots? These plots can help you set you --p-trim-left-r settings so you can keep more of your reads, and maybe we can offer a recommendation.

What region did you sequence? How long do you expect the overlap between the pairs reads to be?

Colin

rahul · January 17, 2020, 6:43am

Dear Collin,

I thank you for your reply to my problem.

I hereby attach the screenshots of the quality score plot for your reference.

Since it is Illumina data the quality scores seems to be good enough with an average at 37.

For the first part of the query: Not being able to view the quality scores

The quality scores are getting loaded once I zoom or click onto the plot. Post this, the quality scores are being shown in the browser. However if I download the csv file it has all the quality values. Maybe it is taking some time to load the quality scores is what i am guessing.

To your question

"What region did you sequence?":

We have amplified and sequenced the V3 and V4 regions for the analysis.

How long do you expect the overlap between the pairs reads to be?
There will be no overlap as we have amplied the entire V3 and V4 and sequenced it. The product would be of around 450 bases

Below are the screenshots of the quality scores of the sequences:

Mehrbod_Estaki · January 17, 2020, 7:54am

Hi @rahul,
With apologies to @colinbrislawn, I'm just jumping in here since you marked this as an urgent issue.
The V3V4 region as you indicated is ~450 and requires ~~at least 150 bp overlap~~ a 2x300 run for it to successfully merge. Since you have only sequenced 2x150 bp length reads there is no overlap so dada2 (or any other tool) will fail to merge them. You can still use this data but you'll want to discard your reverse or forward reads and use one of them only.
Your quality plots also look a bit odd to me, have these been pre-processed in any way prior to importing into Qiime2? You'll ideally want to import raw FASTQ files without any prior processing. Try rerunning just your forward reads and see how that changes the output of DADA2.

rahul · January 17, 2020, 8:48am

Hello @Mehrbod_Estaki,

I thank you for your reply and explaining why the tool dada2 is not working properly.
As far as pre-processing is concerned, I have not done any pre-processing. These are raw reads itself.

Thank You once again.

Thank You for your time and effort @colinbrislawn. Thank you once again.

rahul · January 17, 2020, 10:29am

Dear @colinbrislawn,

Please share your method of how to use the

> --p-trim-left-r and --p-trunc-len-f
parameters to obtain highest number of reads after importing and denoising for Paired-end Illumina reads.

I tried with an Illumina Mi-Seq data for denoising, which gave different outcomes varying from as low as 22 reads for 85K reads as rep-seqs to 500 reads for the same data as rep-seqs to which I believe should be much more as I merged the reads using some other tool and found a R1-R2 joining rate as high as 78% which amounts to around 67K reads

Please tell me what and how can i use this denoising to get representative sequences set as high as possible. The SRX id for the sequences choosen is SRX7571462.

Thanking you in advance.

Best Regards,
Rahul Yadav

colinbrislawn · January 17, 2020, 2:52pm

Good morning Rahul,

I'm glad you are making progress. Thanks for jumping in, @Mehrbod_Estaki

I wanted to check on this:

This tells me that those other tools are wrong; your reads do not overlap and so should not be merged at all!

Well, the goal is to set these four settings (trim and trunc, f and r) so dada 2 can join the maximum number of reads... but that doesn't make sense here; your reads should be joining at all!

One option to consider is just processing your forward reads using dada2 denoise-single
https://docs.qiime2.org/2019.10/plugins/available/dada2/denoise-single/

This does not need reads to join at all and user your high-quality forwards reads.

Let us know what you think,
Colin

rahul · February 4, 2020, 6:08am

Good Morning Colin,

I got my sample resequenced and I have been trying to analyse the my sample. I performed the denoising using dada2 and found that now my reads have a 95% of overlapping sequnces. The sequences were merged and I got the representative sequences, stats and table as output.

I checked the read length distribution of the sequences and found out that there is a huge number below the length of 200. I want to filter the reads below 200 and take the reads that are suppose of length 200, 230, 240, 260, and 280. In which way should I set the denosing parameters in order to get that kind of read length.

Kindly guide me through this.

I thank you in advance for your help and suggestions.

Thanking You.

Best Regards,
Rahul Yadav

rahul · February 4, 2020, 12:28pm

Dear Colin,
I would like you to elaborate on how can i generate fixed length sequences after denoising for the paired-end reads.

For example: I need the reads of 200bp length, 220bp, 240bp and so on.

Kindly guide me through.

I thank you in advance.

Best Regards,
Rahul Yadav

colinbrislawn · February 4, 2020, 6:41pm

Hello again Rahul,

Are you still sequencing the V3-V4 region? How long are your Illumina reads?

I ask because it let's us calculate how long we expect our reads to be.

We also need to find out why your reads are of varying length. Are you sequencing a varying length region?

I'm a little surprised by your results, and want to make sure I understand them before continuing.

Colin

rahul · February 5, 2020, 5:16am

Dear Colin,

I am thankful for your reply. I hereby answer all your querries below.

Q #. Are you still sequencing the V3-V4 region? How long are your Illumina reads?

ANS:- Yes we have amplified and sequenced the V3-V4 region only. And the product length is around 450 bases itself. But what we did was, we fragmented the library into 250 bases fragments. And we sequenced it for 150 bases so that we get the overlap of around 50 bases. So technically reads of V3-V4 region from same species, without adapters and primer should cover upto 300 bases

Q #. Are you sequencing a varying length region
ANS:- It might be of varying length.

NOTE: I used the parameters as below for the denoising:
qiime dada2 denoise-paired --i-demultiplexed-seqs demux.qza --p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f 150 --p-trunc-len-r 150 --o-table table.qza --o-representative-sequences rep-seqs.qza --o-denoising-stats denoising-stats.qza

Please help me understand how can we fix the denoise representative sequence length to fixed.

I thank you in advance.

Thaning You.

Best Regards,
Rahul Yadav

colinbrislawn · February 5, 2020, 5:07pm

Hello Rahul,

This is a very interesting approach! It's like you have shotgun sequenced your amplicon!

I've never seen data like this, and I'm not sure how well it would work with the normal Qiime 2 pipelines. Keep in mind that most pipelines assume every read is from the same region, so variable length covering different regions might be an issue.

Let's check with @benjjneb and see if dada2 supports 'shotgun-amplicons'

benjjneb · February 5, 2020, 5:53pm

No, not really. DADA2 is OK with variable lengths in the reads (although sensitivity is improved when length is the same throughout) but it is making a rather strong assumption that the reads all begin in the same place. If I understand correctly, that assumption is being violated here as the 250bp fragments are randomly located within the V3V4 region.

I would not recommend DADA2 on this type of data, or any ASV method probably. My recommendation would be to merge and then perform taxonomic classification directly on the merged reads.

rahul · February 7, 2020, 5:21am

Dear Colin,

I thank you for your reply. I think if we are fragmenting the library in such a way that the V3 and V4 regions with some overlapping. So i think if we merge the R1 and R2 reads, we can still analyze the data.
Please revert back with thoughts.

Thanking You.

Best Regards,
Rahul Yadav

rahul · February 7, 2020, 5:39am

Dear Ben,

I am extremely thankful for your suggestions. Actually, I was also thinking of the same. Previously I did for the same for my data but since the merge rate of the forward and reverse reads were very less, I couldn't analyse it.
This time I have merged and analysed the data but I am not very sure of how reliable the analysis is.

Please share some of your insights on this.

I had a 450 bp amplicon, which was fragmented into 250 bp fragments and then sequenced.

I merged the R1 and R2 reads, which i then analysed using qiime2 pipeline.

My Question:

I just want to how much I can rely on this protocol.

I am getting pretty good number of OTUs. But my doubt is, is it possible to get false positives. And by false positives I mean mixed sequences getting annotated against the green genes database.

Please share your views.

Thanking You.

Best Regards,
Rahul Yadav

rahul · February 7, 2020, 8:31am

Dear Colin,

I would like to take the count of sequences participating in the formation of a cluster. Please suggest me the possible ways that qiime uses to keep track of the read counts of the reads that take part in a cluster formation.

Also I would like to filter out the clusters with a count of less than 10. Please explain how can i filter the
clusters with repseq count more than a certain thresh-hold.

Thanking You in advance

Best Regards,
Rahul Yadav

colinbrislawn · February 7, 2020, 3:11pm

Good morning Rahul,

Once you have made a frequency table, the count of each cluster is the number of sequences inside of it!

Use this plugin, but be careful! Filtering using an arbitrary threshold is hard to defend (why use 10, and not 5 or 20?)
https://docs.qiime2.org/2019.10/plugins/available/feature-table/filter-features/

Colin