Hi everyone,
I would like to know how you usually deal with chimeras in 16S data, given that Kraken2 does not explicitly remove them (and, as far as I understand, it might actually handle them relatively well).
Here’s the dilemma I’m having with my colleagues: should we remove chimeras or not? Personally, I think it might not be strictly necessary because Kraken2 can classify them, but my colleagues argue that chimera removal is still required.
The problem is that if I remove chimeras, I need to work with ASVs (via DADA2), but then I can’t directly use Bracken afterwards for abundance re-estimation. This makes the pipeline inconsistent for relative abundance estimates.
My current workflow is basically:
cutadapt → bbduk → bowtie2 → qiime2 → dada2 → kraken2 → krona → (bracken? → krona?)
Do you think chimera removal is still necessary in this case? How do you usually handle this when combining ASV pipelines with Kraken2/Bracken?
I'm not an authority on chimera removal, but I think it makes sense to remove them as it's common practice with amplicon analysis. This makes your results more comparable to other studies!
(You are doing 16S analysis, right? If I may be nosy, what region are you sequencing? What is that bowtie2 step doing?)
Answering your question, yes! I’m working on 16S analysis, more precisely in the V3–V4 region. Since I’m working with both human and mouse samples, I’m using Bowtie2 for decontamination.
I was wondering about chimeric sequences because, since Kraken2 can handle chimeras, I’m not sure if it makes sense to remove them. My understanding is that when a k-mer doesn’t match with anything, or it simply skips a taxonomy level, it will be discarded, as you can see in the figure.
I did remove the chimeras from my samples, and I ended up with only a few ASVs. That made me wonder: if I don’t really need to remove chimeras from my samples, then I could just work with the post-processed FASTQ files (adapters removed, low-quality sequences, etc.).
I asked the same question in another forum, and someone raised the same point. So, I decided to include BBMap (for decontamination) and BBDuk (for primer and PhiX removal) in my pipeline by the guys indication. This way, I avoid redundant steps since BBMap handles decontamination and BBDuk removes primers and PhiX in a single pipeline.
I am curious about your comments that kraken2 can somehow handle chimeras. I see no mention of this ability in the publications, in the manual, in the GitHub repo for kraken2, or anywhere else. I would be curious if you have more info, e.g., a benchmark testing kraken2 on different chimeric seqs.
That could be the case for non-target sequences that are missing from the database, or for spurious errors in a sequence read. But chimeras are not going to be total nonsense, which is the big danger — they are chimerized from two (or more) true biological molecules, which will likely be represented (or similar enough to something that is represented) in the database, and which could be mixed in roughly equal proportions. So the k-mer profile of a chimera will not look like the tree in the example that you shared (where there is one k-mer that maps to another branch in the tree, the case you would have from a spurious error in a sequence), rather the k-mers could map at similar proportions to two or more branches, leading to shallow classification. So I don’t think that kraken2 will necessarily handle any chimera that you throw at it, some QC should still be done upstream.
and why could you not use bracken after removing chimeras?
You may need to adjust the parameters, e.g., the abundance threshold or other criteria for deciding if a seq is a chimera.
Just adding my 2 cent here, if you want to stick yo your pipeline you could use vsearch to remove chimeric sequences then work with kraken/bracken. The vsearch option does not return ASVs but simply a subset of non-chimeric sequence. You can detect by comparing to a know database or de-novo.
The help are:
This is from an older version of qiime but should still be possible with the latest as far I am aware (please anyone correct me if I am wrong).
Other then that, if I have amplicon sequence I usually stick to sklearn to taxonomy assignment, while if I am working with fragmented library (metagenomics datasets) I would to to the kraken2/bracken pipeline, now very well conveniently wrapped into the qiime moshpit distribution.
You’re right that Kraken2 doesn’t mention chimera removal. My comment was an inference (something like, I just took it from my head ) based on how its classifier works: when a read contains k-mers from multiple taxa (as chimeras do), Kraken2 aggregates the k-mer votes and follows the LCA rule, which tends to push the call up to a higher rank rather than a species. This behavior is described in the papers/docs; e.g., “If the k-mers yield multiple IDs, then Kraken computes the subtree of all the species that it found, and outputs a taxonomy label corresponding to the path in the tree with the most k-mers.” (Salzberg & Wood, 2021).
Even in Jennifer Lu’s paper, there isn’t a chimera-removal step (Lu & Salzberg, Microbiome, 2020). Could it be because of mock community data? I’ve read other benchmarking papers, but none of them are using, or at least they don’t state explicitly any chimera removal (Odom et al., Scientific Reports, 2023; Wright et al., Microbial Genomics, 2023). The only one I’ve found that actually uses chimera removal is Improving Species Level-taxonomic Assignment from 16S rRNA Sequencing Technologies (Bars-Cortina et al., 2023).
Since a chimera is a mix of real biological sequences, Kraken2 won’t “fix” it, it will just end up classifying higher up the tree. But would that necessarily be a problem? Like you said, the outcome would likely just be a shallower classification. If I remove the chimera upstream, I’m removing the entire read instead of keeping it as a less specific assignment.
About Bracken, the reason I asked is that since Bracken redistributes reads, once I transform my data into ASVs I end up with far fewer reads left to calculate relative abundance. That’s why I’ve been questioning whether it makes sense to do chimera removal at all. Because if I decide to remove them, I basically need to rely on DADA2 to do it. I’ll share my parameters and results later so you can see what’s happening.
Thanks a lot!!! This will definitely help! I’ll go ahead and try exactly what you suggested with VSEARCH before running Kraken2/Bracken. Thank you again!!
I suppose it’s just a question of whether you want to take out the garbage upstream (do chimera filtering first) or downstream (remove reads with shallow classification, which could be chimera, short reads, off-target or other junk, or potentially real signal that is not classifying well for some reason). Personally I would do the chimera filtering to help diagnose just a bit… but with appropriate filtering you would probably get more or less the same result in the end anyway.
I suggest looking into the --p-min-fold-parent-over-abundance parameter… there are a few topics on this forum that discuss it. Sometimes this needs to be adjusted to avoid over-zealous
Thanks a lot for the feedback, this will really help me improve my pipeline. I really appreciate it, this has been bothering me for a few months already, so your answer means a lot.