Release 12 enables chimera checking with uchime-ref or uchime-denovo. What method is the more accurate one with higher efficacy of chimeric sequence removal? Also, my workflow is
Dereplication
Open-reference clustering
Filtering of low abundance features
Classification
My sequence data is pre-quality checked so I dont use DADA2 or Deblur at this point. Where would I insert the chimera checking step into the pipeline? Any preference of which uchime method to use given my workflow?
You can run uchime-ref at any time, so it's probably best to do it late in your pipeline, when you have fewer features to check. Say after step 3.
Uchime-denovo requires size annotations, so you have to run it after step 1 (dereplication adds size annotations). I have seen people do uchime-denovo before or after clustering (or both!). Greg recommends running it after clustering.
The uchime de novo algorithm is slow, so running it after clustering saves some time. (Actually it's pretty fast but not easily parallelizable, so it's seems slow!). Running it before clustering may improve accuracy because there are more parent reads that can explain and detect low abundance chimeras.
Hi all, and thank you @colinbrislawn for your post - that was very helpful.
I ran the chimera check - denovo as done in the tutorial and can see some problems in the filtered output sequences. Here my numbers:
Before chimera checking: 70500 sequences
intermediate file "chimeras.qza": 52500 sequences
output non-chimeric without borderline chimeras: 17500 sequences
output non-chimeric including borderline chimeras: 18000 sequences
Obviously, chimera check retains too many sequences with the default configuration. So, I have 2 questions: What could be a reason why I have so many declared chimeric sequences based on default? What parameters (--p-dn FLOAT,--p-mindiffs INTEGER RANGE,--p-mindiv FLOAT,--p-minh FLOAT ,--p-xn FLOAT ) can I tweak to soften the chimera-filtering and retain a higher proportion of my sequences?
I'm not sure if chimera checking is retaining too many sequences, or too few. Are you using a known sample to verify your measurements or do you have some sort of positive control? While you could increase the --p-minh to keep more of your reads and remove fewer chimeras, it's hard to pick the right setting without knowing the correct answer.
Hi @steff1088,
I am not sure that a mock community could be constructed with a "known" chimera fraction — that would need to be empirically identified, e.g., through the process you are using. I expect that @colinbrislawn meant instead that you can use a mock community to test how chimera checking impacts the replication of the expected composition, e.g., with q2-quality-control.
If you do not already have a mock community, you could grab one from mockrobiota to test.
I agree with Colin — I am not sure how you know that chimera filtering is retaining too many sequences with the default configuration. ~2/3 of your seqs does sound high but we don't really know the right answer, and the literature has many reports of chimera being very common in 16S rRNA gene sequencing data sets. For perspective, dada2 often results in similar — or greater — loss, much of it to chimeric seqs. So 2/3 is not too hard to swallow, in my opinion.
You could check out the vsearch docs to see what recommendations they have for tweaking parameters but I am not really aware of any benchmarks for this. I'd recommend either sticking with the defaults and accepting the high sequence loss or using a mock community to optimize this process on your own (which would be a lot of additional work if that's not what you're already doing!)
Is there any recommendation for which database to use for uchime-ref for 16S primers targeting bacteria and archaea. In the literature I have seen the Silva Gold database (used by chimera slayer) and a database by the Broad Institute as part of their Microbiome Utilities package.
You could also use the full database you use for taxonomy assignment. The developer of uchime used to recommend small, high-quality databases. Then he recommended large, complete databases. Now he recommends using a de novo filter based on your current data set (something like uchime-denovo).