Data loss during DADA2

Hi,

I am having issues running the denoising step as I am left with very few sequences following dada2 (e.g. out of aprox 560000 sequences I am left with under 100!). The issue I have is that the command runs for a couple of seconds and then produces the artefacts, instead of running for 10-15 minutes like it did in the moving pictures tutorial. This is the command I used

qiime dada2 denoise-single

--i-demultiplexed-seqs Forward.qza
--p-trim-left 0
--p-trunc-len 204
--o-representative-sequences rep-seqs.qza
--o-table table.qza

I am using the single end manifest format with 33 phred scores. I should say that I have ran into this issue a couple of days ago as well, and I managed to get around it by changing all the directions in the manifest from reverse to forward - although I don't know how that could affect the denoising. However, I tried it again now and it didn't work... so now I don't really know what to do to get around this problem.

Mapping-file.tsv (2.1 KB)
Manifest.csv (4.3 KB)
quality-plot.qzv (282.9 KB)

Thanks.

This data has been pre-filtered, and almost all your sequences are less than 204 nts. This causes almost all the data to be lost because –p-trunc-len 204 truncates to 204, and also throws away sequences that aren’t 204 long. You’ll need to choose a shorter truncation length, or turn off truncation altogether.

1 Like

Hi,

Thanks for the help!

From the Moving Pictures tutorial I understood that you need to choose a parameter based on the quality scores i.e. where the quality starts to drop. Is it safer then to always chose a parameter size that is within the blue range in the quality plot (in this case 168 or smaller) even though the quality score is high?

Also, wouldn’t turning off truncation mean that the sequences that have really low quality scores for the end bases are kept as they are? I thought the aim was to remove the ends with low quality?

The tutorials are working from raw data, whereas your data has already been filtered trimmed and rarefied. In the case of Illumina, the raw data is all the same length while in your data the preprocessing introduced length variation, so you might have to do things slightly differently.

The reads you are working with apparently were trimmed to be as short as 168 nts in whatever pre-processing was performed. So, if you don’t want to throw away reads for being too short, the longest --p-trun-len you can use is 168.

If you turn off --p-trunc-len you will want to use the --p-max-ee filter, which will toss all reads with more expected errors than the threshold you pick, which will get rid of low quality sequences.

1 Like

Great, thanks, that makes sense! I obtained the data as it is from my supervisor, so I was not aware of what had already been done to the sequences.

One more basic question - for the --p-max-ee filter, how can one choose a sensible value? I have seen that the default is 2, but I would imagine not many sequences will get through in that situation?

You’ll probably get the vast majority of your sequences through with max_ee of 2. If so then I wouldn’t recommend raising it.

But running the data and checking at the end is the way to know for sure!

So I ran the command:

qiime dada2 denoise-single
–i-demultiplexed-seqs Forward.qza
–p-trim-left 0
–p-trunc-len 0
–p-max-ee 2
–o-representative-sequences rep-seqs2.qza
–o-table table2.qza

and it ran for 1-2 h but I think it worked well! I was left with 6828 representative sequences, out of 569996 sequences, which seems about right?

Thanks so much for the help!

2 Likes

The number of ASVs you'll get out of different environments varied dramatically, so its hard to say that a count of ASVs is right or not. That number is certainly in the range expected in commonly measured environments.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.