Chimera check questions

steff1088 · January 3, 2018, 5:31pm

Hi everybody,

Release 12 enables chimera checking with uchime-ref or uchime-denovo. What method is the more accurate one with higher efficacy of chimeric sequence removal? Also, my workflow is

Dereplication
Open-reference clustering
Filtering of low abundance features
Classification

My sequence data is pre-quality checked so I dont use DADA2 or Deblur at this point. Where would I insert the chimera checking step into the pipeline? Any preference of which uchime method to use given my workflow?

Cheers,
steffen

colinbrislawn · January 3, 2018, 11:09pm

Hello Steffen,

EDIT: I just found this excellent chimera checking tutorial written by Greg. Definitely start there.

You can run uchime-ref at any time, so it's probably best to do it late in your pipeline, when you have fewer features to check. Say after step 3.

Uchime-denovo requires size annotations, so you have to run it after step 1 (dereplication adds size annotations). I have seen people do uchime-denovo before or after clustering (or both!). Greg recommends running it after clustering.

The uchime de novo algorithm is slow, so running it after clustering saves some time. (Actually it's pretty fast but not easily parallelizable, so it's seems slow!). Running it before clustering may improve accuracy because there are more parent reads that can explain and detect low abundance chimeras.

Let me know if that helps,
Colin

steff1088 · January 4, 2018, 4:14pm

Hi all, and thank you @colinbrislawn for your post - that was very helpful.

I ran the chimera check - denovo as done in the tutorial and can see some problems in the filtered output sequences. Here my numbers:

Before chimera checking: 70500 sequences
intermediate file "chimeras.qza": 52500 sequences
output non-chimeric without borderline chimeras: 17500 sequences
output non-chimeric including borderline chimeras: 18000 sequences

Obviously, chimera check retains too many sequences with the default configuration. So, I have 2 questions: What could be a reason why I have so many declared chimeric sequences based on default? What parameters (--p-dn FLOAT,--p-mindiffs INTEGER RANGE,--p-mindiv FLOAT,--p-minh FLOAT ,--p-xn FLOAT ) can I tweak to soften the chimera-filtering and retain a higher proportion of my sequences?

Cheers,
steffen

steff1088 · January 4, 2018, 4:16pm

By the way, I included chimera-filtering in my workflow in between clustering and filtering of low abundance features.

colinbrislawn · January 4, 2018, 5:35pm

Hello Steffen,

I'm glad you got this working.

I'm not sure if chimera checking is retaining too many sequences, or too few. Are you using a known sample to verify your measurements or do you have some sort of positive control? While you could increase the --p-minh to keep more of your reads and remove fewer chimeras, it's hard to pick the right setting without knowing the correct answer.

Colin

steff1088 · January 4, 2018, 8:12pm

@colinbrislawn If you meant to use a mock community, where can I find a sample with know chimera fraction?

Nicholas_Bokulich · January 4, 2018, 8:31pm

Hi @steff1088,
I am not sure that a mock community could be constructed with a "known" chimera fraction — that would need to be empirically identified, e.g., through the process you are using. I expect that @colinbrislawn meant instead that you can use a mock community to test how chimera checking impacts the replication of the expected composition, e.g., with q2-quality-control.

If you do not already have a mock community, you could grab one from mockrobiota to test.

I agree with Colin — I am not sure how you know that chimera filtering is retaining too many sequences with the default configuration. ~2/3 of your seqs does sound high but we don't really know the right answer, and the literature has many reports of chimera being very common in 16S rRNA gene sequencing data sets. For perspective, dada2 often results in similar — or greater — loss, much of it to chimeric seqs. So 2/3 is not too hard to swallow, in my opinion.

You could check out the vsearch docs to see what recommendations they have for tweaking parameters but I am not really aware of any benchmarks for this. I'd recommend either sticking with the defaults and accepting the high sequence loss or using a mock community to optimize this process on your own (which would be a lot of additional work if that's not what you're already doing!)

I hope that helps!

Analissa_Sarno · January 5, 2018, 8:29pm

Hi Q2 community,

Thank you for the helpful conversation.

Is there any recommendation for which database to use for uchime-ref for 16S primers targeting bacteria and archaea. In the literature I have seen the Silva Gold database (used by chimera slayer) and a database by the Broad Institute as part of their Microbiome Utilities package.

Thank you for your assistance!

colinbrislawn · January 5, 2018, 9:04pm

I've used the RDP Gold database with uchime-ref.
https://www.drive5.com/uchime/uchime_download.html

You could also use the full database you use for taxonomy assignment. The developer of uchime used to recommend small, high-quality databases. Then he recommended large, complete databases. Now he recommends using a de novo filter based on your current data set (something like uchime-denovo).

¯\_(ツ)_/¯

Nicholas_Bokulich · January 5, 2018, 9:15pm

Thanks for the advice and history lesson, @colinbrislawn!

So it sounds like uchime-denovo is the currently recommended method (instead of providing a reference database with uchime-ref).

system · February 6, 2018, 3:15am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.