Automatically predicting trimming parameters.

99_clades · July 6, 2024, 8:29pm

Hi everyone,

I have been using qiime2 for a few months now, and although, for my own analysis, I like to run commands manually and change parameters as needed, my supervisor has asked me to automate the whole process. One of the things I found challenging was the prediction of trimming parameters for the qiime2 dada2 denoise-paired command. Normally, I look at the demux.qzv file to manually select the trimming parameters, and I go through a few different values before finalizing, but since that is not possible here, I was wondering if there was a way to automatically predict the trimming parameters --p-trunc-len-f and --p-trunc-len-r.

I am aware that there is a tool called Figaro that can automatically predict trimming parameters, but I am not sure if there is a Qiime2 plugin for that. And it seems like the developer hasn't made any releases recently (latest release: September 13, 2020). So, I don't know if it's still worth looking into.

Any help is very much appreciated!

colinbrislawn · July 7, 2024, 3:47pm

Hello @99_clades,

Welcome to the forums! :qiime2:

I've been there! A few people have built more automated pipelines around Qiime (1) and Qiime2 over the years. Here's one I made in college (Brislawn 2014)

Choosing the 'domain' of this pipeline, what you want to automate vs what you want to control manually, is important. Perhaps you can start automating the parts of the process in which user inputs don't change much, then later add parts like DADA2 in which the user must choose trimming settings.

You have probably done this already, but if not, consider reviewing the other pipelines folks have made. It's always good to do a lit review!

Let us know if you have more questions along the way. For example, we can help you make a q2-figaro plugin!

salias · July 8, 2024, 12:10pm

Hi there

I would like to add that one option here could be that your automatic approach allows you to execute the whole thing or until a certain step. So you can either run all and then check if the parameters you manually added were okay, or run automatically until e.g. obtaining demux.qzv, then update your config file with the parameters you decide, and finally run the rest of the pipeline. It's not completely automatic, but I think that spending a bit of time manually checking the parameters and restarting the pipeline from there is better than running the whole pipeline trying to guess parameters and then having to repeat everything.

This is the approach I'm using in my (still under development) ITS pipeline:

Since the pipeline runs on Snakemake, I can specify up to which step I want to run it with the --until flag (described here):

--until, -U

    Runs the pipeline until it reaches the specified rules or files. Only runs jobs that are dependencies of the specified rule or files, does not run sibling DAGs.

jwdebelius · July 8, 2024, 6:39pm

Hey @99_clades,

Multiple people, multiple opinions... To add to the excellent opinions and options by @colinbrislawn and @salias, I'll throw in my 2 cents worth.

I think this is potentially really dangerous, especially if you're trying to combine the output of multiple dada2 runs! Remember that ASVs are externally valid based on their sequence. So, if the trimming parameter for one run says trim at 150 and the trimming parameter for another run says trim at 125, you're going to get different ASVs (and the runs will separate). If you're planning to combine across runs, you need consistent (not necessarily automated) parameters.

I've seen automated solutions that pick a hard coded "best" based on several sequencing runs with the primers/sequencer and then check closely as part of an integrated pipeline.

So, I think Figaro could be a great solution, but I also think there's a heck of a lot of value in looking at your specific data and figuring out what it says.
(There's a frustrated commentary somehwere in this that I need to finish writing.)

Best,
Justine

99_clades · July 8, 2024, 9:09pm

Hi @salias

Thank you for replying and sharing your work. I'll definitely look into it.

I understand the value of manually checking the demux.qzv summary to avoid re-running the whole process, and I used to do that for some time. However, my supervisor is asking for a completely automated pipeline that predicts trimming parameters automatically. This is because the pipeline will be tested (and maybe used in the future) by someone with no experience in this field. From my experience, without manually checking the quality and overlap, it is difficult to achieve near-optimal results after denoising.

But then again, it makes me wonder how applications like the Illumina BaseSpace 16S Metagenomics app do this all automatically.

Best,
Abhi

99_clades · July 8, 2024, 9:11pm

Hi @colinbrislawn ,

Thanks for your reply and the resources you provided.

I am developing this pipeline to analyze 16S data, with the goal of producing ASVs, 97% OTUs, their corresponding taxonomic files, and a phylogenetic tree using the IQ-TREE ultrafast bootstrap plugin.

I have automated parts of my workflow that don't require frequent changes. The only remaining part is the prediction of trimming parameters. Almost every other parameter is set to its default value.

Currently, when the user initiates the pipeline on the command line, it prompts them to provide trimming parameters. This avoids the need for the user to stop and examine the demux.qzv files before proceeding to denoise using DADA2. However, this means the user has to run FastQC on the samples and decide the trimming parameters beforehand so the rest of the pipeline can run uninterrupted.

The issue my supervisor raised is that if we have to run 50 different samples, it's not very efficient to run FastQC for each one. Apologies if this sounds like a silly question, but if the samples are vastly different, wouldn't it make more sense to create separate trees for each one?

Regarding Figaro, I had some issues getting it to work initially, but I was able to resolve them by modifying the source code slightly.

For Figaro to work (please correct me if I'm wrong), I believe all the reads should be of the same length. Wouldn't that cause a problem if there's real biological variation? Since the majority of reads in the sample dataset I used were 251 bases long, I trimmed them to that length and then ran Figaro on those reads. I then used the values obtained from Figaro as trimming parameters for the untrimmed reads. I haven't used this tool before, so I am not sure if this is the right way to do this.

Best,
Abhi

99_clades · July 8, 2024, 9:11pm

Hi @jwdebelius

Thank you for your input.

Currently, my pipeline processes each sample individually, running the entire workflow on each. So, if I am providing different trimming parameters for each, it's probably not a good idea to combine the results.

Originally, the pipeline processed all samples together. However, due to concerns raised by my supervisor about the variability among samples, I adjusted it to run on each sample separately.

I'll definitely need to chat with my supervisor about this.

Cheers,
Abhi

colinbrislawn · July 8, 2024, 10:18pm

Hello Abhi,

Okay cool! It sounds like you have made good progress already!

As you have discovered here, those trimming parameters must be chosen for each Illumina run because these vary across each Illumina run. This means you must already know about the run quality before starting a pipeline. It's a catch-22... unless you can inspect results and change settings mid-pipeline.

Vsearch fastq-stats can make this data directly after importing these files.

Okay, this is intriguing. I have built pipelines that are reliable enough to be run 'blind' at my last job, and this depended on a LIMS system with good reporting and a team of bioinformatics folks to help maintain it.

If your PI does not have a plan for long-term bioinformatics support, they should consider that now. I do this sort of work professionally, and I would be happy to learn more about your project and submit a bid. There are a few other consultants on the forums who may also be interested in working with your team.

This post by the developer of BWA and Minimap2 comes to mind:
https://lh3.github.io/2019/03/11/on-maintaining-bioinformatics-software

jwdebelius · July 8, 2024, 10:55pm

Hi @99_clades,

I agree with Colin's sentiments about long term maintenance and making plans to have someone to work with your pipeline in the future. I think your current pipeline has some issues that need to be addressed first.

If you're building a pipeline, I assume that at the end of the day, your goal is to be able to determine something about a comparable ecosystem. Even if your downstream partner can't do bioinformatics, hopefully they're doing biostatitics. However, you need a pipeline that's going to facilitate that, which means minimizing variation due to sample processing, including sample bioinformatics. I may be closed minded here, but I tend to assume that we've catalogued most things we want to catalogue and are getting close to the point where we can use microbiome data to start testing some kind of hypothesis.

If your plan is cataloguing, feel free to ignroe what I have to say. If your plan is analysis, you need consistency.

DADA2 includes a step that trains a model on the data and it tends to assume this step is evaluated on a per-sequencing run basis. So, if you have 50 samples in your sequencing run, it wants to learn from data in all 50 samples, If you're running one sample at a time, you may be overfitting your data pretty badly, as well as getting different trimming parameters! The external validity of ASVs is also based on using the same parameters. So, you have data denoised based on an overfit data set that can't be combined, so it cant be used for many of the statistical analyses you want to do... and so you've wasted a bunch of computational resources processing data that wont actually be useful.

I also have no idea how you write a useful methods section for an automated pipeline which runs per-sample. ...If you're dropping out your bioinformatician, you probably don't. Colin is probably prepared for me to climb up on my soapbox, so I'll resist because I promised no soapboxes.

Best,
Justine

99_clades · July 13, 2024, 6:20pm

Hi @colinbrislawn & @jwdebelius,

Apologies for the delayed response. Thank you both for your inputs on this. I had a long discussion with my supervisor (not my PhD supervisor, I should've mentioned) about this and I was able to convince him why performing analysis on each sample individually isn't ideal. I analyzed a single sample and then all samples together, and showed the differences in results for that particular sample between the two methods.

There's a knowledge gap here, as not many in my team are familiar with 16S amplicon analysis (or tools like qiime2/dada2). However, we're making progress in understanding the best practices.

I plan to implement Figaro in the future to fully automate this pipeline and I am currently looking at other workflows to see how they have done this.

Thanks again for your inputs on this. If there’s anything else you think I should consider while attempting to automate this workflow, please let me know.

Best regards,
Abhi