Automated sequence truncation algorithm

Hello Qiime 2 community

I developed an algorithm for truncating sequences based on a user-specified length and mean quality threshold, originally for Sanger sequences. I've always been annoyed (and concerned) with how manual of a process it is to set sequence truncation parameters for pre-processing fastq sequences, like when using DADA2. As a result, I've been considering developing a tool to truncate fastqs using this algorithm. Do any of you think this type of tool would be useful? Could you see this becoming a part of the QIIME ecosystem?

Thanks for your advice!
Michael

2 Likes

Hi @michaelsilverstein ,
Sounds interesting! Just to clarify, could you let us know how this is different from the functionality available in q2-quality-filter and the q-score threshold parameter in q2-dada2? These allow automatic truncation based on q-score (as well as arbitrary length truncation).

q2-dada2 will truncate at the first instance of a low-quality q-score, and q2-quality-filter also allows a sliding window to trim when there is a stretch of low-quality bases (in addition to/instead of just trimming at the first base below threshold).

Sounds like you're using a mean quality threshold... so if that's sufficiently different from what's already available, sounds like either a good method to add to q2-quality-filter, or a stand-alone plugin (depends on what you want, and specifics on dependencies/maintenance etc).

2 Likes

Hey Nicholas, thanks for the quick response. Yes, the algorithm I have developed differs from available methods as far as I'm aware. The user provides a mean (not min) quality score and the algorithm identifies the best range of sequence to keep.

Sounds different enough!

So does this filter sequences based on mean q-score?

or trim at the site where average q-score drops below a threshold (so all sequences are trimmed at the same site)? which sounds much more interesting.

q2-quality-filter could maybe also be tweaked so that the sliding window parameter is tweaked to look at average score across the window.

1 Like

Great! Let me develop it into a working product and get back to you (in a few weeks probably).

Beta version complete

I've just developed the beta version here: GitHub - michaelsilverstein/mineer: A sensible sequence trimming algorithm! I provide an explanation of the algorithm in the README and a tutorial to test the functionality. I've built out a few unittests and things seem to working as expected so far - feel free to give it a try.

Handling read pairs where only one read passes

I am running into an issue on the best way to handle read pairs where only one read passes quality control.
The pipeline works by identifying positions to truncate all forward and reverse reads at and only reads that pass the user-defined mean q-score and sequence length thresholds are kept. In the test case I am running, many forward reads pass, but many reverse reads do not, and as a result many read pairs are dropped. This doesn't seem like the best approach since many of these pairs could probably be saved after merging. One alternative could be what dada2 does: truncates the sequences to the specified lengths and only rejects those that are too short. Any thoughts?

1 Like

Thanks for sharing! I look forward to giving it a try!

This sounds like a reasonable "last ditch" approach.

Great, thanks @Nicholas_Bokulich! I am working on testing this method against other commonly used strategies in preparation for a publication.

Are there are any considerations for incorporating this as a QIIME plugin? Currently, inputs are specified by providing a directory and file suffixes (like _1.fastq) and truncated files are written to a user specified directory.

I've added an argument to the pipeline so the user can choose how to filter readpairs. The options are either 1) remove read pairs where either read has q-score < threshold, 2) remove read pairs where only when both have q-score < threshold, and 3) don't filter out any readpairs based on q-score, just for length < truncation length.

Yes — you would need to consider if existing QIIME 2 types fit the inputs/outputs to your method. For example:

A QIIME 2 plugin would take a single sequence artifact as input (different demultiplexed fastq formats? e.g., SampleData[SequencesWithQuality | PairedEndSequencesWithQuality]). So no need to specify a suffix or directory path, only an understanding of the directory structure inherent to the underlying formats. These formats can admittedly be quite complex... you can check out this function in q2-quality-control to see one way to parse either single-end or paired-end data:

Likewise, the user would not specify a directory, rather an output artifact filepath.

Flexibility is good!

This sounds useful and quite exciting... plugin development documentation and tutorials can be found here: QIIME 2 Developer Documentation — QIIME 2 Developer Documentation 2020.2 documentation

But let us know if you have any specific questions or run into any issues when developing your plugin, we are happy to help!

And when you have a functioning plugin, you can document it in the QIIME 2 library for easy discovery by others (writing a tutorial on this forum is also very welcome).

1 Like