Allowed Primer-read mismatch in extract-reads?

jwdebelius · November 27, 2019, 4:13pm

I'm probably missing something obvious, but is there information somewhere about the primer/read mismatch allowed in qiime feature-classifier extract-reads. It isn't an exposed parameter, but it feels like it should be?
Context is that Im trying to use it to benchmark something else, which sets its threshhold at an error rate of 2 nt. I'm getting reads extracted from feature classifier that have primer-errors of 4 nts after accounting for degeneracy.

Thanks!
Justine

Mehrbod_Estaki · November 27, 2019, 11:06pm

Hey @jwdebelius,
This isn't a straight answer by any means but maybe some guidance. From the cutter description:

# `sequence` may contain degenerates. These will usually be N
        # characters, which SSW will score as zero. Although undocumented, SSW
        # will treat other degenerate characters as a mismatch. We acknowledge
        # that this approach is a heuristic to finding an optimal alignment and
        # may be revisited in the future if there's an aligner that explicitly
        # handles degenerates.

Looks like SSW is used for extraction which has a default value of 2, but that is a weighted score and not exact # of mismatches. I went down this road a while back not being able to figure out what those scores were and finally decided to use cutadapt to extract reads when I had degenerate primers. On the other hand Cutadapt does handle degenerates and doesn't count them as mismatches (from what I can tell anyways). It might be good to have that as an additional plugin in qiime2, seems to me that cutadapt would be superior anyways.

jwdebelius · November 28, 2019, 10:11am

Thanks. I'm trying to match the feature-extraction capacity. (I dont feel like I can make the code-based reliant on QIIME and therefore need a QIIME-independent solution) and it didn't feel like a cut-adapt applicable solution. Which maybe makes it a different group here, but eh? My naive regex-based alignment allows for degenerates but has a strict definition of "aligned" in terms of the number of matches basepairs, which is why I asked.

And, thank you for the references, I will read more!

Best,
Justine

BenKaehler · November 29, 2019, 4:23am

HI @jwdebelius, @Mehrbod_Estaki,

extract-reads handles degenerates correctly. The local aligner from skbio that we use has some ... interesting behaviour regarding degenerates, but from memory we worked hard to work around them.

The answer to your question is in —p-identity:

  --p-identity NUMBER     minimum combined primer match identity threshold.
                                                                [default: 0.8]

So the mismatch threshold is defined as a fraction of the combined lengths of both primers, and is applied to the mismatches accumulated across the primers.

I am not saying that there is a biological justification for that behaviour, but that is what it is.

I have been intending for some time to revisit this method. Last time I tried my favourite in-silico PCR simulator was ipcress. Who knows, this summer I may get around to implementing a wrapper for it.

Cheers,
Ben

jwdebelius · November 29, 2019, 8:57am

Hi @BenKaehler,

Thank you so much! This was what I was looking for and explains the discrepancy between my (admittedly not great) implementation and the QIIME version.

And, I'll check out Ipcress, that may be an easier solution than my own implementation.

Best,
Justine

BenKaehler · November 29, 2019, 7:51pm

Thanks @jwdebelius! If you do try out ipcress I would be interested to learn your impressions.

Cheers,
Ben