Snakemake pipelines for generating taxonomy classifiers

I recently started to learn how to construct Snakemake pipelines this past week. There is still much for me to learn, but I was able to successfully put together a Snakefile, that will:

  1. download the 16S rRNA reference database from the Genome Taxonomy Database (GTDB),
  2. extract a couple of variable regions (only V4, V3V4 for now, I'll add others)
  3. dereplicate the sequence and taxonomy data for full-length and variable regions
  4. train a naive bayes taxonomy classifiers for all of them.

As outlined via this DAG:

Would this be something useful to aide in generating premade classifiers for each QIIME 2 release? Anyone interested in helping me streamline these Snakemake pipelines? I figure, these Snakemake pipelines would be useful to upload to the RESCRIPt repo too.


Hey @SoilRotifer ,
@lina-kim and @cmatzoros have been making similar pipelines with nextflow. We had planned to set up a RESCRIPt pipeline as well, but if you want to collaborate with us on this part that would be great.

I have wanted to do this for a long time with RESCRIPt to replace the current classifier training workflows. :raised_hands:


I'd be happy too collaborate on this!

I have a few upcoming projects, that will require leveraging both snakemake and nextflow. So, I figured I'd start teaching myself via RESCRIPt. :slight_smile:

Anyway, I figure that I can start putting together the snakemake pipelines for our most common use cases, i.e. SILVA, UNITE, GTDB, etc... then progress into the more complicated pipelines that leverage pulling from GenBank, extract-seq-segments, etc...

Should I start a separate github repo for these, for now? Or do you have a place in mind? Or I can start a DropBox to share too...


What a great plan, thanks @SoilRotifer for getting started! Yes, it would play so nicely with the Nextflow workflow @cmatzoros and I have been building.

Separate GitHub repo is a good start, we currently have our own (hopefully modular) structure in place. I'll follow up with more details via DM.


Good morning!

I've been working on one of these too!

On what hardware do you run these? I've used both institutional HPC and cloud-native options.

Where do you distribute results? I've been using GitHub releases, though I would prefer an official Q2 location like Data resources — QIIME 2 2023.9.2 documentation

Viewing this as a CI/CD platform, I'm interested in the package manager part, which I don't think we have yet...


That is cool @colinbrislawn!

I might be sending a bunch of snakemake questions your way. Although I have a few of these pipelines producing output, they could certainly be cleaned up a bit and be more generalizable. Thinking about how to structure filenames and wildcards is surely a mental exercise!

1 Like