Training classifier using RESCRIPt and enlarging database with 'extract-seq-segments' for single-end sequences

SoilRotifer · January 31, 2024, 2:43pm

Thank you for your detailed post! I think you are off to a great start!

As I mentioned in the post you linked:

Basically, you are getting leaky length extractions as you iterate.

Which I also reference here, that is:

... the initial PCR primer pair extracts only the amplicon region, the later iterations could expand your reference pool to sequences that contain a longer portion of your marker gene...

As I suggested, in the tutorial, and within these other threads, you can start with a different similarity threshold, and change the thresholds as you iterate. That is, you can:

... consider starting with a 90% cutoff, then increase by 2-5 % after each iteration, e.g. 90, 95, 97, ...

In my experience this has helped to cut down on the "length extension creep" that occurs with this approach. That is, you might extract a slightly longer sequence segment from your initial sequence pool. That sequence segment is retained and then used to extract other potentially longer sequence segments in the next iteration. Which means, you can then extract a yet an even longer segment, again, for the following iteration, and so on. So, it is probably better to perform more iterations at a higher similarity threshold. Then, at the end, you can remove any sequences that are longer than your expected target. I think if you are only adding ~10% more length, I think you should be fine. But this will vary from marker gene to marker gene.

I hope this helps! Please do keep us posted!