Substantial difference on number of features between qiime2-2019.7 and dada2-1.12.1

yanxianl · August 15, 2019, 2:52pm

Hi,

I used dada2 to denoise my pair-ended reads in qiime2-2019.7 and R (dada2 version, 1.12.1), and got 3533 and 2535 features in the raw ASV table, respectively. The differences in the parameters I used in R were setting pool=TURE for sample inference, and using method=pooled for chimera removing. I was expecting to see more features in the ASV table deniosed by dada2 with pool=TURE as singleton reads were retained. While there's a big difference in the number of features, the total number of sequences after denoising are quite similar, being 7,717,544 and 7,540,610 respectively.

I noticed that dada2 in R uses 1e8 bases for denoising whereas qiime2-2019.7 uses 1 million reads. But I doubt this is the cause. Could someone explain why I'm seeing such a big difference in the number of unique features obtained in qiime2 and R?

Here're my ASV tables:
table-QIIME2-2019.7.qzv (581.7 KB) . table-DADA2-1.12.1.qzv (687.3 KB) .

Warmly,
Yanxian

ebolyen · August 15, 2019, 6:22pm

Hi @yanxianl!

This is a great question, and one I'm not super confident in answering, but something that might be informative is the respective denoising stats from both analyses. This will tell us if many features are removed at chimera checking or somewhere else.

A naive guess would be that by pooling your data, the more uncertain and "unique" ASVs were collapsed together, since they were identified as part of the same expected distribution across samples. But it does seem very odd that the process removed 1000 features (it may just be chimera checking is working better after pooling, but we'd need the denoising report/table to know more).

cc @benjjneb

yanxianl · August 15, 2019, 7:04pm

Hi Evan!

To make it easier to figure out the problem, I'll post the codes and stats here.

QIIME2 codes:

import the demultiplexed fastq files
qiime tools import \
--type 'SampleData[PairedEndSequencesWithQuality]' \
--input-path data/raw/casava-18-paired-end-demultiplexed \
--input-format CasavaOneEightSingleLanePerSampleDirFmt \
--output-path data/qiime2/demux.qza

summarize sequence data and visulaize reads quality
qiime demux summarize \
--i-data data/qiime2/demux.qza \
--o-visualization data/qiime2/demux.qzv

#Sequence quality control and feature table construction based on sequence quality
qiime dada2 denoise-paired \
--i-demultiplexed-seqs data/qiime2/demux.qza \
--p-trim-left-f 20 \
--p-trim-left-r 18 \
--p-trunc-len-f 290 \
--p-trunc-len-r 248 \
--p-n-threads 16 \
--o-table data/qiime2/table.qza \
--o-representative-sequences data/qiime2/rep-seqs.qza \
--o-denoising-stats data/qiime2/stats.qza \
--verbose \

quality check: number of sequences kept at each step of the dada2 pipeline
qiime metadata tabulate \
--m-input-file data/qiime2/stats.qza \
--o-visualization data/qiime2/stats.qzv

#Overview of raw feature-table
qiime feature-table summarize \
--i-table data/qiime2/table.qza \
--m-sample-metadata-file data/metadata.tsv \
--o-visualization data/qiime2/table.qzv

Qiime2 denoising stats:
stats.qzv (1.2 MB)

RMarkdown rendered html report for the dada2 pipeline.
dada2-report.zip (2.2 MB)

Yanxian

ebolyen · August 15, 2019, 7:40pm

Thank you @yanxianl,

The RMarkdown report is wonderful. Looking at the specifics, I don't see anything terribly different between the two report tables, we're up or down a few hundred/thousand reads and neither analysis is clearly that different overall.

One thing that does catch my eye:
In the R report, I see that the dimensions of the table are as follows:

## [1]   80 7590

Which is pretty different from the 2,535 we see after importing. Is it at all possible that something has gone wrong in between these steps? Not that the difference between between 3,533 and 7,590 is any easier to explain, but something seems strange here.

ebolyen · August 15, 2019, 7:42pm

I suppose one more odd things is the ASV length distribution:

table(nchar(getSequences(seqtab)))

## 
## 270 272 273 274 275 276 277 278 279 280 281 282 283 284 285 287 288 289 290 291 292 293 294 295 296 
##  17   1   2   2   2  16   2  31  22 240   3   2   1   1   1   6   2   1   8  10  10  23   7  20  12 
## 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 
##  31  13  20  80  25  67  43 397 102 175 191 322 296 468 122 219  87 105 127 291 375 547 162 121 157 
## 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 
##  52  61  72  17 251 124 551 108 190 100 185  46  29  59  91  48   7   1   9  16  78 355   8  57   4 
## 348 354 357 358 360 361 362 369 370 374 379 380 382 384 385 389 390 392 393 399 400 401 404 405 407 
##   1   1   1   1   2   2   1   1   5   1   2   1   1   1   1   5   5   2   1   1   2   1   1   1   1 
## 409 410 412 413 418 420 425 426 427 428 431 434 435 437 438 439 441 447 451 452 467 469 470 476 480 
##   1   1   1   1   1   2   1   1   2   2   1   1   9   1   1   4   1   1   1   1   2   1   1   1   1 
## 482 484 488 
##   2   1   1

What amplicon are you targeting for this, does this much variation seem reasonable?

yanxianl · August 15, 2019, 8:50pm

Hi @ebolyen, I also noted the exceptionally high number of unique features before removing chimeras. I reran the analyses and found that unique features were reduced from 7590 to 2535 during the chimera removing. However, I didn't see a significant drop in sequence number per sample (5% on average).

dim(seqtab)
[1] 80 7590

dim(seqtab.nochim)
[1] 80 2535

For the amplicon sizes, I did have the same doubt. The observed amplicon sizes are within the expected size ranges based on the in silico evaluation.

Mehrbod_Estaki · August 15, 2019, 11:26pm

Hi @yanxianl,
I just wanted throw in 2 quick comments. 1) I really enjoyed your rmarkdown file, I wish everyone used and published these! I am working on a comprehensive version of one atm which I think you might enjoy which I'll share when it is complete. 2) If I had to guess I would think the main reasons for the unexpected lower # features is that with pooled chimera removal DADA2 is able to more accurately detect chimeras and so is actually appropriately removing more features that are chimeras, compared to q2-dada2 deault which is not pooling. If you have doubts about these you can always blast the removed sequences and see what they actually were, or try increasing the minFoldParentOverAbundance to higher values, starting say with 8? That will likely retain more of those features.

benjjneb · August 16, 2019, 12:35am

I agree with @Mehrbod_Estaki that the difference here is probably driven largely by the chimera removal method, and is probably playing out mostly amongst low frequency sequences.

Pooled sample inference and pooled chimera removal versus independent sample inference and consensus chimera removal is a significant shift. It won't (in my testing) change much in terms of gross community composition, but it has differences especially amongst rare and infrequent taxa.

yanxianl · August 16, 2019, 8:33am

Hi @Mehrbod_Estaki and @benjjneb, thanks for your answers on the question.

I tried to replicate the qiime2-dada2 analyses in R by setting dada(pool=FALSE, selfConsist = TRUE, ...) and removeBimeraDenovo(method="consensus", ...), and I got a similar number of unique features (3642).

dim(seqtab)
[1] 80 6470

dim(seqtab.nochim)
[1] 80 3642

sum(seqtab.nochim)/sum(seqtab)
[1] 0.975

Indeed, setting pool=TRUE resulted in more features but a much higher proportion of these features were removed during the detection of chimeras under the pooled mode. That's why I ended up with less uniqe features.

Mehrbod_Estaki · September 5, 2019, 4:58am

A post was split to a new topic: What is a normal # of features from dada2 output

system · October 6, 2019, 10:58am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.