right-trim option in extract-reads for paired-end data?

Sean_McKenzie · March 24, 2020, 8:37pm

Hello, no idea if this is the appropriate way to make a feature request, but it seems problematic that (as far as I can tell) there's no way to trim some base-pairs from the right side of the extracted reads (using extract-reads from feature-classifier). Given that --p-trunc-len can't be used for paired end data because of variable fused insert lengths, right now if someone has trimmed bases from the left of the reverse read (so the right end of the amplicon), it seems that the reads extracted by feature-classifier will be sub-optimal. Adding a --p-trim-right option to remove X nucleotides from the right side of the extracted read would be really helpful.

Nicholas_Bokulich · March 24, 2020, 8:48pm

Welcome to the forum @Sean_McKenzie!

It is! Thanks for posting your feature request here. We prefer having FRs reported on the forum so that we can triage and file in the correct repository.

That's a good point. I have opened a ticket for this here — contributions are welcome!

I think we did not add this feature previously because usually the trim-left options in dada2 are used to trim away primers at the start of the read, which extract-reads is already doing. Even if trim-left is used to trim off some more of the reverse read (e.g., due to low quality at the start of the reverse read), having a "tail" in the extracted reference reads that extend beyond the query would not have a massive impact on performance. Trimming vs. full-length classifiers (generally) appears to have only a small performance boost, so having a few extra bases in the trimmed reference vs. the query is likely to have an even narrower margin of improvement, but still I agree with you there could be an effect and this would be important to expose (especially for methods like the classify-consensus-vsearch exact-match option)

Thanks!

Sean_McKenzie · March 24, 2020, 9:06pm

Thanks for the quick response @Nicholas_Bokulich! It's quite reassuring to know that a few extra trailing bases in the reference reads vs the query reads shouldn't impact the classification- I don't have any sort of instinct for this yet as I've only just started using a pipeline with a naive bayes classifier.
Thanks again!
Sean