Snakemake pipelines for generating taxonomy classifiers

I recently started to learn how to construct Snakemake pipelines this past week. There is still much for me to learn, but I was able to successfully put together a Snakefile, that will:

  1. download the 16S rRNA reference database from the Genome Taxonomy Database (GTDB),
  2. extract a couple of variable regions (only V4, V3V4 for now, I'll add others)
  3. dereplicate the sequence and taxonomy data for full-length and variable regions
  4. train a naive bayes taxonomy classifiers for all of them.

As outlined via this DAG:
dl-extract-derep-train

Would this be something useful to aide in generating premade classifiers for each QIIME 2 release? Anyone interested in helping me streamline these Snakemake pipelines? I figure, these Snakemake pipelines would be useful to upload to the RESCRIPt repo too.

5 Likes

Hey @SoilRotifer ,
@lina-kim and @cmatzoros have been making similar pipelines with nextflow. We had planned to set up a RESCRIPt pipeline as well, but if you want to collaborate with us on this part that would be great.

I have wanted to do this for a long time with RESCRIPt to replace the current classifier training workflows. :raised_hands:

6 Likes

I'd be happy too collaborate on this!

I have a few upcoming projects, that will require leveraging both snakemake and nextflow. So, I figured I'd start teaching myself via RESCRIPt. :slight_smile:

Anyway, I figure that I can start putting together the snakemake pipelines for our most common use cases, i.e. SILVA, UNITE, GTDB, etc... then progress into the more complicated pipelines that leverage pulling from GenBank, extract-seq-segments, etc...

Should I start a separate github repo for these, for now? Or do you have a place in mind? Or I can start a DropBox to share too...

6 Likes

What a great plan, thanks @SoilRotifer for getting started! Yes, it would play so nicely with the Nextflow workflow @cmatzoros and I have been building.

Separate GitHub repo is a good start, we currently have our own (hopefully modular) structure in place. I'll follow up with more details via DM.

4 Likes

Good morning!

I've been working on one of these too!

On what hardware do you run these? I've used both institutional HPC and cloud-native options.

Where do you distribute results? I've been using GitHub releases, though I would prefer an official Q2 location like Data resources — QIIME 2 2023.9.2 documentation

Viewing this as a CI/CD platform, I'm interested in the package manager part, which I don't think we have yet...

2 Likes

That is cool @colinbrislawn!

I might be sending a bunch of snakemake questions your way. Although I have a few of these pipelines producing output, they could certainly be cleaned up a bit and be more generalizable. Thinking about how to structure filenames and wildcards is surely a mental exercise!

1 Like

Hi @SoilRotifer

Snakemake sounds intriguing. I have been putting together a workflow for some data and I have ended up with a bash script which encompasses every step after QC (which is done separately).

What are the advantages of using Snakemake vs compiling your own script?

Thanks.

Hi @Mike_Stevenson, I've abandoned snakemake and am now using NextFlow. I have a private github repo that works to construct RDP, GTDB, and SILVA classifiers for the full length and several amplicon regions. It works, I just need to better organize the outputs. I'll be adding UNITE soon.

These workflows automate handling many steps for you with minimal code. Also, is quite reproducible with conda environments, job sumissions (i.e. slurm, etc...), I hope to implement these at some point.

2 Likes