The value of mean length after dada2 is too different from the expected value

Hello everyone,

I'm a beginner of qiime2, and I'm practicing processing a batch of bacteria data. But after dada2, there are some problems. First of all, I import the data optimized and spliced by sequencing company as single ended data.

qiime tools import --type "SampleData[SequencesWithQuality]" --input-format SingleEndFastqManifestPhred33V2 --input-path manifest.txt --output-path demux_seqs.qza

demux_seqs.qzv (290.0 KB)

Due to the optimized data and the high quality of the sequence displayed in the quality interaction diagram, I ran the following code to do dada2 without cutting.

time qiime dada2 denoise-single --i-demultiplexed-seqs demux_seqs.qza --p-trunc-len 0 --o-table table.qza --o-representative-sequences rep_set.qza --o-denoising-stats stats.qza --p-n-threads 0
table.qzv (457.6 KB) rep-seqs.qzv (334.6 KB) stats.qzv (1.2 MB)

The above is the visualization file I got after running. But the mean length is 184, and most of the sequences are less than 200. This is far from the effective length reported by sequencing company, which is about 440.

Question:

  1. Why does this happen?

  2. Is it because there is something wrong with the parameter setting when I use dada2? What is the appropriate parameter?

  3. If I ignore this problem and continue with the follow-up analysis, is the analysis result useful?

I look forward to your answer and sincere thanks.

Hello @LiyingXie,

If you are using paired end Illumina reads, I'm willing to bet that 'optimized and spliced' means the reads were joined using a program like vsearch or PEAR.

While this is ok... DADA 2 works even better with the raw, unmerged reads! DADA2 trains error profiles and performes error correction before merging, so using reads directly from the MiSeq is best.

If your data was not sequenced on the Illumina platform or you can't get the raw data, please let me know


Hopefully this will fix the reads becoming much shorter after denoising with DADA2!

Colin

Hi Colin,

Thank you very much for your reply. I think I see what you mean. Dada2 is not suitable for denoising the trimmed sequence, and my question is here. Is my understanding correct?

This batch of data is sequenced on Illumina platform, and I can also obtain rawdata.

In fact, when I got the data, I first imported rawdata as a two terminal sequence. My data is bacterial data, and the primer is 338f_ 806R. I run the following code to import the data and get the demux.qzv file.

time qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' --input-path manifest.txt --output-path demux.qza --input-format PairedEndFastqManifestPhred33V2

demux.qzv (309.4 KB)

Then, the samples are trimmed and denoised according to the interactive quality plot and primer information. Run the following code and get three visualization files.

time qiime dada2 denoise-paired --i-demultiplexed-seqs demux.qza --p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f 275 --p-trunc-len-r 205 --o-table table.qza --o-representative-sequences dada2-rep-seqs.qza --o-denoising-stats stats.qza --p-n-threads 0
dada2-rep-seqs.qzv (704.0 KB) stats.qzv (1.2 MB) table.qzv (515.7 KB)

Next, I run the following code to convert the feature table into a file format that can be viewed.

But I found two problems:

  1. The quality of my sequence is very high, and the data reaching Q20 is more than 97%, but the data filtered in stats.qzv file is about 84%. The amount of data is much worse.

  2. After converting feature-table.biom file into feature-table.txt file, I check it, but I find that an OTU only appears in one sample, and the most frequent one is 182. This is certainly not true.

I'm very upset and don't know the cause of these problems, which is beyond the ability of a beginner. I've been listening to this for days. Do you know what causes this phenomenon? How can I solve it?

Looking forward to your reply. Sincerely.

1 Like

This is excelent! Having the raw data is very important and gives us all the clues we need to solve this problem!

So 806 reverse - 338 forward means your expected amplicon is 468 bp long. (806-338=468)
And your reads are 300 bp each, so your expected overlap of reads is 132 bp. (300*2-468=132)

With high quality data, 132 base pairs is more than enough to join. We are off to a good start!

However, when we look at the quality of the reads, we can see that the quality starts very high, but decreases, especially on the Reverse Read. (This is common on Illumina, especially on 300 paired end runs.)

trunc-len-f 275 --p-trunc-len-r 205

I like the truncation settings you chose, because it will keep the best parts of your reads and removes the low quality ends. :+1:

Unfortunately, this will also reduce the area that is able to join because the area of overlap is smaller.

Before trimming: 468 bp amplicon: 300 bp forward + 300 bp reverse = 132 bp overlap
After trimming: 468 bp amplicon: 275 bp forward + 205 bp reverse = 12 bp overlap!


Blue line shows truncation point
Red line shows area of overlap

Yes

In the DADA2 plugin for Qiime 2, the minimum overlap needed to join is 20 bp. Because these trimming settings make an overlap of only 12 bp, we would not expect any reads to join.

Edit: The dada2 minimum overlap is now 12 bp by default! This means your reads to have enough length to join, if there is no mismatches between forward and reverse after the denoising process.

So while your overlap length should be enough, it looks like the quality is low enough in that region that many reads are unable to join.


We have two good options to fix this problem!

One option is to increase your truncation lengths.
Using --trunc-len-f 280 --p-trunc-len-r 220
would give you 32 bp of overlap, which should be enough to join. Try it and let me know how it goes!

Edit: Now that I know your read overlap is already long enough and reads have trouble joining, I worry that a longer overlap would not solve this problem. It's still worth trying, but you might have better luck with this second option:

A second option is to import only your forward read and use denoise-single. By choosing not to use the reverse read, you avoid any problems with joining and the drop in quality near the end of the the reverse read. I think that having a small amount of high quality data :sparkling_heart: is better than having a large amount of low quality, noisy data :broken_heart:.

We are always here to help you, and answer any questions you have.

Let me know what you try next, and how it works on your data set,
Colin

Hello, Colin :wave:

Today, I tried the two solutions you suggested, but unfortunately, the effect is not ideal. :frowning_face:

I ran the command and trimmed the sequence to 280 and 220, respectively. Then I looked at the output visualization file.rep-seqs-280.qzv (1.3 MB) stats-280.qzv (1.2 MB) table-280.qzv (640.5 KB)

On the contrary, the number of sequences passed is lower, so I lose too many sequences. One OTU still appears in only one sample. Although the frequency of one OTU has increased, it is far from the normal value.

So I tried the second method. Import forward sequence as single ended data.demux_seqs.qzv (287.9 KB)

I ran the following command,

time qiime dada2 denoise-single --i-demultiplexed-seqs demux_seqs.qza --p-trunc-len 260 --o-table table-260.qza --o-representative-sequences rep_set-260.qza --o-denoising-stats stats-260.qza --p-n-threads 0

I looked at the visualization file.rep-seqs-260.qzv (1.7 MB) stats-260.qzv (1.2 MB) table-260.qzv (910.3 KB) . I found that the passing rate of the sequence is very high, which is very good. But the number of OTUs is obviously wrong, which is too high. According to the report provided by sequencing company, the number of OTUs is about 2000. Moreover, OTU has the same error as before, that is, an OTU only appears in one sample, and the frequency is very low.

Question:

  1. If we give up the low quality of the two terminal data, is the subsequent diversity analysis still of biological significance? Will abandoned data have a great impact on the overall analysis?

  2. Why do OTU problems recur? Data import seems to be normal, so the problem is very likely to appear in dada2. Why? Could it be that there is something wrong with my rawdata, similar to some abiotic sequences?

  3. Is the similarity of OTU clustering 97% in dada2? I suspect that many OTUs are the same now, but they are not clustered together for some reasons. Is this suspicion reasonable?

  4. What should I do next? Is there any other solution? :sob: :sob: :sob: :broken_heart: :broken_heart: :broken_heart:

Looking forward to your reply. Sincerely. :pray: :pray: :pray:

Hello again @LiyingXie,

Just as I had feared. Because the region overlapping is low quality, a longer overlap does not cause more reads to join...

OK, using only the forward read fixed the problems with quality and joining! :+1:

That is a problem, but luckily I have found the cause :point_down:

It looks like your reads start with some random bases, then an adapter, then the true read. I'm willing to bet that these 6 bp are barcodes, causing your ASVs to seperate by sample.

(This paper lists actcctacgggaggcagcag is a 16S primer.)

If you trim off the barcode and adapter, your ASVs should appear across samples!

Try running with --p-trim-left 26 and see how it goes!


Yes. Removing low quality data should make your diversity analysis more accurate! :sparkles:

Those 6 starting basepairs and adapters look pretty abiotic to me! :robot: :face_with_monocle:

Even better: the ASVs created by DADA2 could be 100% unique, with as little as 1 bp difference between them! The DADA2 paper explains this more. :bookmark:

Keep in touch,
Colin

2 Likes

Hello, Colin :grinning:

I tried to run this parameter and it turned out to be a pleasant surprise. The result is normal. It's so exciting. Thank you a thousand times.

But I still have some problems. Can we use qiime cutadapt to solve the problem of primer in rawdata? I think the effect is the same, and it doesn't have to look for primer in the sequence, which seems more convenient.

So I made an attempt with Paired End Sequences.

qiime cutadapt trim-paired --i-demultiplexed-sequences demux.qza --p-adapter-f ACTCCTACGGGAGGCAGCA --p-adapter-r GGACTACTAGGGTTTCTAAT --o-trimmed-sequences trimmed-demux-paired.qza --verbosetrimmed-demux-paired.qzv (316.9 KB)

Everything in the visualization file looks normal. I went on to dada2, but at this step I had a problem.

time qiime dada2 denoise-paired --i-demultiplexed-seqs trimmed-demux-paired.qza --p-trim-left-f 0 --p-trim-left-r 0--p-trunc-len-f 270--p-trunc-len-r 210--o-table table-cutadapt.qza --o-representative-sequences dada2-rep-seqs-cutadapt.qza --o-denoising-stats stats-cutadapt.qza --p-n-threads 0
dada2-rep-seqs-cutadapt.qzv (201.1 KB)

After dada2, all my sequences were filtered out. It's incredible.

Question:

  1. Is this my cutting length selection problem? However, according to the quality interaction diagram in the trimmed demux paid file, I think my parameter selection is reasonable.

  2. After qiime cutadapt processing, I found that the compressed file of forward and reverse sequence is very small, only about 1000KB. It used to be about 10000 KB. Why?

Thanks for the help. :pray:

Hello again,

It sounds like cutadapt might not be working well... I don't know a lot about cutadapt so I'm not sure how to fix this.

It looks like the quality in the area of overlap is still to low to join properly. Because this relates to the quality of your reads, I'm not sure there is a perfect way to fix this.

Based on what we have tried so far, it sounds like
qiime dada2 denoise-single --p-trim-left 26...
might be the best path forward.

Your welcome! :partying_face:

I know it disappointing to not use your reverse reads, but given the quality of this sequencing run, I think that's the best way forward.

Colin

1 Like

good pun! :drum: :drum: :qiime2:

@LiyingXie - you might want to try using the "front" parameter (instead of the "adapter" parameter) when running q2-cutadapt - the command you shared above appears to be trimming from the 3' end of your reads, but I think maybe you mean to be trimming from the 5' end. Let us know how it goes!

3 Likes

Hello, Matthew :wave:

I'm sorry it took so long to get back to you.

I tried the front parameter, but the result is still not very good. I ran the following command:

qiime cutadapt trim-paired --i-demultiplexed-sequences demux.qza --p-front-f ACTCCTACGGGAGGCAGCA --p-front-r GGACTACTAGGGTTTCTAAT --o-trimmed-sequences trimmed-demux-paired-1.qza --verbose

time qiime dada2 denoise-paired --i-demultiplexed-seqs demux.qza --p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f 270 --p-trunc-len-r 210 --o-table table-front.qza --o-representative-sequences dada2-rep-seqs-front.qza --o-denoising-stats stats-front.qza --p-n-threads 0

dada2-rep-seqs-front.qzv (717.7 KB) table-front.qzv (518.4 KB) stats-front.qzv (1.2 MB) trimmed-demux-paired-1.qzv (314.5 KB)

After using cutadapt, I think the primers have been removed. And the length of sequencing in the file trimmed-demux-paired-1.qzv is shorter, which may confirm my idea. But there are still primers in the sequence shown in dada2-rep-seqs-front.qzv file.

Question:

  1. Is there a problem with the command I run?

  2. If cutadapt doesn't work, why is the length of my sequence reduced instead of 300bp?

Looking forward to your reply. :pray:

1 Like

What did the output log say? cutadapt is very good at telling you what it did and did not do while searching for adapters - we need to read that log to understand what happened.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.