I developed an algorithm for truncating sequences based on a user-specified length and mean quality threshold, originally for Sanger sequences. I've always been annoyed (and concerned) with how manual of a process it is to set sequence truncation parameters for pre-processing fastq sequences, like when using DADA2. As a result, I've been considering developing a tool to truncate fastqs using this algorithm. Do any of you think this type of tool would be useful? Could you see this becoming a part of the QIIME ecosystem?
Hi @michaelsilverstein ,
Sounds interesting! Just to clarify, could you let us know how this is different from the functionality available in q2-quality-filter and the q-score threshold parameter in q2-dada2? These allow automatic truncation based on q-score (as well as arbitrary length truncation).
q2-dada2 will truncate at the first instance of a low-quality q-score, and q2-quality-filter also allows a sliding window to trim when there is a stretch of low-quality bases (in addition to/instead of just trimming at the first base below threshold).
Sounds like you're using a mean quality threshold... so if that's sufficiently different from what's already available, sounds like either a good method to add to q2-quality-filter, or a stand-alone plugin (depends on what you want, and specifics on dependencies/maintenance etc).
Hey Nicholas, thanks for the quick response. Yes, the algorithm I have developed differs from available methods as far as I'm aware. The user provides a mean (not min) quality score and the algorithm identifies the best range of sequence to keep.
I am running into an issue on the best way to handle read pairs where only one read passes quality control.
The pipeline works by identifying positions to truncate all forward and reverse reads at and only reads that pass the user-defined mean q-score and sequence length thresholds are kept. In the test case I am running, many forward reads pass, but many reverse reads do not, and as a result many read pairs are dropped. This doesn't seem like the best approach since many of these pairs could probably be saved after merging. One alternative could be what dada2 does: truncates the sequences to the specified lengths and only rejects those that are too short. Any thoughts?
Great, thanks @Nicholas_Bokulich! I am working on testing this method against other commonly used strategies in preparation for a publication.
Are there are any considerations for incorporating this as a QIIME plugin? Currently, inputs are specified by providing a directory and file suffixes (like _1.fastq) and truncated files are written to a user specified directory.
I've added an argument to the pipeline so the user can choose how to filter readpairs. The options are either 1) remove read pairs where either read has q-score < threshold, 2) remove read pairs where only when both have q-score < threshold, and 3) don't filter out any readpairs based on q-score, just for length < truncation length.
Yes — you would need to consider if existing QIIME 2 types fit the inputs/outputs to your method. For example:
A QIIME 2 plugin would take a single sequence artifact as input (different demultiplexed fastq formats? e.g., SampleData[SequencesWithQuality | PairedEndSequencesWithQuality]). So no need to specify a suffix or directory path, only an understanding of the directory structure inherent to the underlying formats. These formats can admittedly be quite complex... you can check out this function in q2-quality-control to see one way to parse either single-end or paired-end data:
Likewise, the user would not specify a directory, rather an output artifact filepath.