Quality control by deblur and reference database classifier

Hello, there
I have some problems about deblur and reference database classifier, could any expert give me some guidence? Thank you very much!
(1) In the training feature classifier tutorial part, the developer recommended to use more information-rich sequences at 99% sequence similarity, but in the q2 deblur denoise-16S plugin, the description of the plugin reveals the default reference database used is the 88% sequence similarity from Greengenes 13_8. I wonder if the 88% sequence similarity is appropriate for quality control?
(2) In the description of q2 deblur denoise-16S plugin, only forward reads are supported, so I wonder if it is correct to input join-paired reads in the deblur plugin for quality control?
(3) For 16S, if I want to use Silva database as the Greengenes database didn’t update since 2013, should I use deblur denoise-other for quality control?
(4) For pre-classifier database provided in qiime2 forum, there were full-length version and 515-806 region version, does this mean if I amplified the V4 region by primer set 515/806, I should better use the 515-806 region version?
(5)There are six method provided by q2 feature-classifier, I wonder how can I defined which one is better for classification?
I am a new qiime2 user, sorry for so many questions.

1 Like

@Nicholas_Bokulich or @BenKaehler - can you take the feature classifier questions?

@wasade - can you take the deblur questions?

@zhang - I am pinging some folks above to see if they are available to provide some assistance. Stay tuned! :qiime2:

@thermokarst I really appreciate that.

Hi @zhang,

Deblur answers below!

Deblur uses a coarse reference to drop anything that does not seem to be putatively like the amplicon target. This is intended only to remove artifacts.

Deblur is agnostic to whether the data are or are not joined.

It’s unlikely that this would change the results.

Best,
Daniel

2 Likes

Hi @wasade,

Thank you very much. Your reply is really very helpful and solved most of my puzzles. But there still one thing which is not clear for me. I wonder if there is no difference between the results when I use the default 88% sequence similarity or 99% sequences similarity?
I tried to use deblur-other plugin with reference database at 99% sequence similarity, I haven’t got the results after more than 12h running time. But when I use deblur-16S with default reference database, the results came out within 2h. So I wonder if it is just the time consumption made us to chose a relatively coarse reference. If I want use deblur-other to analysis ITS data which reference sequence identity should I choose? Are there any standards for us to choose appropriate reference sequences identity when we use deblur for quality control?

Thanks,
Zhang

Hi @zhang,

The reason we used 88% similarity was to reduce computation. A sequence only needs to recruit at around 60% sequence identity which is very permissive, so it’s not clear to me if there is a strong benefit in using the 99% set. The intent is to retain sequences that are putatively of the target amplicon type.

For ITS, I believe users have had success using OTUs from UNITE.

Best,
Daniel

1 Like

Hi @wasade,

Sorry for the delay reply. Your reply really make sense and thank you very much.

Best,
Zhang

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.