Interpretation of summary statistics - Deblur

ptalebic · May 5, 2020, 5:35pm

Hi,

I run Deblur on my data (single-end reads generated using Illumina Miseq) and then created two artifacts containing summary statistics using this block of commands:

qiime metadata tabulate
--m-input-file demux-filter-stats.qza
--o-visualization demux-filter-stats.qzv
qiime deblur visualize-stats
--i-deblur-stats deblur-stats.qza
--o-visualization deblur-stats.qzv

This is the results:

My questions:

1- Is it weird to get ZERO for reads-too-short-after-truncation and reads-exceeding-maximum-ambiguous-bases? How about reads-truncated? Could it be because my data had been quality filtered before importing into QIIME2?

2- What is the most important column of the second table (second screenshot) that I need to take into account before proceeding with the next step of my analysis? What is reads-hit-reference? what is reference here?

3- What is reads dereplication?

Sorry for asking basic questions and thank you for creating this amazing forum that is helping me through my first project.

Mehrbod_Estaki · May 6, 2020, 7:00am

Hi @ptalebic,
Check out this previous post that may give some more detailed insights regarding deblur stats meanings. Some of your other questions below:

Nope. Just depends on what your deblur input was. From the looks of it you didn't do any truncating so none of the reads were truncated and thus the 0 s.

This first visualizer is really only describing # of reads you had (total-input-reads), how many were retained after truncating/length filtering (total-retaind-reads), and the other 3 columns explain what happened in the other filtering steps.

The column reads-hit-reference is going to show the total # of reads you retained following deblur. This should match the number of reads per sample you see if you were to summarize your feature-table with feature-table summarize.

This refers to the positive filter Deblur uses which is the greengenes database ( 88% clustered OTUs by default) with some very permissive inclusion criteria (65% identitty, with 50% coverage). Basically if your reads don't look anything like something in this database then they will be tossed.

Basically if you have 1000 reads that are identical, it makes sense to only denoise one of those reads and apply the results to the rest, instead of doing it 1000 times which is time-consuming, computationally expensive, and redundant. Dereplication is what gives you the rep-seqs.qza which are "representative sequences".

Your questions are great! We're happy to help. I also would recommend searching through the forum for key words in your questions, it is likely they have been asked before. This forum has mounds of useful info buried in it!
Best

ptalebic · May 6, 2020, 2:28pm

Thank you for your reply.

ptalebic · May 6, 2020, 2:29pm

But I truncated my reads. This is what I did:

qiime deblur denoise-16S
--i-demultiplexed-seqs demux-filtered.qza
--p-trim-length 220
--o-representative-sequences rep-seqs-deblur.qza
--o-table table-deblur.qza
--p-sample-stats
--o-stats deblur-stats.qza

ptalebic · May 6, 2020, 2:32pm

Is it possible to change the settings and use SILVA? If yes, how? There wasn't any parameters in the documentation.

ptalebic · May 6, 2020, 2:33pm

Can I change the settings for these too?

ptalebic · May 6, 2020, 2:35pm

So what Deblur basically does is the same as closed reference OTU picking process?

ptalebic · May 6, 2020, 2:36pm

What if those reads are identical due to some error?

ptalebic · May 6, 2020, 2:46pm

Please correct me if I am wrong. What Deblur does is to pick one sequence and try to find a match in the database and the threshold is 65% for identity and 50% for coverage? So what is 88%?

Mehrbod_Estaki · May 6, 2020, 3:43pm

Hi @ptalebic,

And how long are your reads actually? Could you upload your demux-summary.qzv please.

Sure, if you want to use your own positive-filter, you'll want to use the deblur-denoise-other plugin instead, see here for its documentations. Though I don't think this is necessary. SILVA is a massive database and it would just be a huge amount of time and computation effort for no real gain as a positive-filter here. Besides, all your reads have hit the reference database, so what will you gain from using SILVA instead?

Hmm, not sure where you are getting that. But no, it is not closed-reference OTU picking. I recommend reading the deblur paper for details. The greengenes sequences are used as a very permissive positive filter to basically toss away any reads that look nothing like 16S reads. The denoising is done independent of the database and you are still getting ASVs (or subOTUs as the deblur developers call them), but these are not OTUs.

They are still going to be treated the same. This process is not unique to Deblur. All bioinformatic tools use this approach because it is logical and saves a lot of time.

No, it is not looking for a match. It is just making sure that the read looks close enough to a 16S sequence to retain or not. Once it passes this positive-filter test, it is not connected to that database in any other way.
So the greengenes database is clustered at various different % identities, 99, 97, 88 etc. For Deblur positive filtering the 88% is used, please have a look at the link I provided earlier for benchmarks tests as why these parameters were used. So, any query sequences needs to be at least 65% similar and have 50% coverage compared to a sequence in the reference sequences in order for it to be approved as a 16S read.

ptalebic · May 6, 2020, 3:52pm

Here is the file:
demux-summary.qzv (290.7 KB)

ptalebic · May 6, 2020, 4:05pm

oh, thanks for the great explanation I will have a look at the link.

ptalebic · May 7, 2020, 12:39pm

ptalebic · May 7, 2020, 12:45pm

I think 220 is not a good choice for truncation length. Is that right? but I wonder why there was no truncation. If my reads were truncated at 220, I would have lost so many reads. Is that right ?

ptalebic · May 7, 2020, 12:48pm

I also read this: Feature Table showing Zero Truncated Reads After Quality Control

In this topic Nicholas_Bokulich suggests the reads might have been trimmed.

Mehrbod_Estaki · May 8, 2020, 3:11am

220 sounds like a pretty reasonable trim point to me. Remember that Deblur doesn't use the quality scores for error model building (unlike DADA2) so where you trim doesn't really depend on on the quality score. If you are having problems with not retaining enough reads (which I don't see as a problem with your run atm) you can trim your reads to say 150. Deblur will then retain much more reads than if you were to trim at 220.

As for why the stats summary is showing 0s, I'm not sure to be honest. Let's see if @Nicholas_Bokulich has any insights, if not then we can ping in one of the Deblur developers for further calrification.

wasade · May 8, 2020, 7:19pm

The reads-truncated information comes from q2-quality-filter, not from q2-deblur, and refers to truncating from the sliding window quality filter.

ptalebic · May 8, 2020, 8:17pm

Thank you so much for your reply.