Snakemake pipelines for generating taxonomy classifiers

SoilRotifer · February 21, 2024, 4:58pm

I recently started to learn how to construct Snakemake pipelines this past week. There is still much for me to learn, but I was able to successfully put together a Snakefile, that will:

download the 16S rRNA reference database from the Genome Taxonomy Database (GTDB),
extract a couple of variable regions (only V4, V3V4 for now, I'll add others)
dereplicate the sequence and taxonomy data for full-length and variable regions
train a naive bayes taxonomy classifiers for all of them.

As outlined via this DAG:
dl-extract-derep-train

Would this be something useful to aide in generating premade classifiers for each QIIME 2 release? Anyone interested in helping me streamline these Snakemake pipelines? I figure, these Snakemake pipelines would be useful to upload to the RESCRIPt repo too.

Nicholas_Bokulich · February 21, 2024, 6:32pm

Hey @SoilRotifer ,
@lina-kim and @cmatzoros have been making similar pipelines with nextflow. We had planned to set up a RESCRIPt pipeline as well, but if you want to collaborate with us on this part that would be great.

I have wanted to do this for a long time with RESCRIPt to replace the current classifier training workflows.

SoilRotifer · February 21, 2024, 7:00pm

I'd be happy too collaborate on this!

I have a few upcoming projects, that will require leveraging both snakemake and nextflow. So, I figured I'd start teaching myself via RESCRIPt.

Anyway, I figure that I can start putting together the snakemake pipelines for our most common use cases, i.e. SILVA, UNITE, GTDB, etc... then progress into the more complicated pipelines that leverage pulling from GenBank, extract-seq-segments, etc...

Should I start a separate github repo for these, for now? Or do you have a place in mind? Or I can start a DropBox to share too...

lina-kim · February 22, 2024, 8:38pm

What a great plan, thanks @SoilRotifer for getting started! Yes, it would play so nicely with the Nextflow workflow @cmatzoros and I have been building.

Separate GitHub repo is a good start, we currently have our own (hopefully modular) structure in place. I'll follow up with more details via DM.

colinbrislawn · February 23, 2024, 1:52pm

Good morning!

I've been working on one of these too!

On what hardware do you run these? I've used both institutional HPC and cloud-native options.

Where do you distribute results? I've been using GitHub releases, though I would prefer an official Q2 location like Data resources — QIIME 2 2024.2.0 documentation

Viewing this as a CI/CD platform, I'm interested in the package manager part, which I don't think we have yet...

SoilRotifer · February 23, 2024, 3:34pm

That is cool @colinbrislawn!

I might be sending a bunch of snakemake questions your way. Although I have a few of these pipelines producing output, they could certainly be cleaned up a bit and be more generalizable. Thinking about how to structure filenames and wildcards is surely a mental exercise!

Mike_Stevenson · July 25, 2024, 6:14pm

Hi @SoilRotifer

Snakemake sounds intriguing. I have been putting together a workflow for some data and I have ended up with a bash script which encompasses every step after QC (which is done separately).

What are the advantages of using Snakemake vs compiling your own script?

Thanks.

SoilRotifer · July 25, 2024, 6:20pm

Hi @Mike_Stevenson, I've abandoned snakemake and am now using NextFlow. I have a private github repo that works to construct RDP, GTDB, and SILVA classifiers for the full length and several amplicon regions. It works, I just need to better organize the outputs. I'll be adding UNITE soon.

These workflows automate handling many steps for you with minimal code. Also, is quite reproducible with conda environments, job sumissions (i.e. slurm, etc...), I hope to implement these at some point.