I recently started to learn how to construct Snakemake pipelines this past week. There is still much for me to learn, but I was able to successfully put together a Snakefile, that will:
download the 16S rRNA reference database from the Genome Taxonomy Database (GTDB),
extract a couple of variable regions (only V4, V3V4 for now, I'll add others)
dereplicate the sequence and taxonomy data for full-length and variable regions
train a naive bayes taxonomy classifiers for all of them.
As outlined via this DAG:
Would this be something useful to aide in generating premade classifiers for each QIIME 2 release? Anyone interested in helping me streamline these Snakemake pipelines? I figure, these Snakemake pipelines would be useful to upload to the RESCRIPt repo too.
Hey @SoilRotifer , @lina-kim and @cmatzoros have been making similar pipelines with nextflow. We had planned to set up a RESCRIPt pipeline as well, but if you want to collaborate with us on this part that would be great.
I have wanted to do this for a long time with RESCRIPt to replace the current classifier training workflows.
I have a few upcoming projects, that will require leveraging both snakemake and nextflow. So, I figured I'd start teaching myself via RESCRIPt.
Anyway, I figure that I can start putting together the snakemake pipelines for our most common use cases, i.e. SILVA, UNITE, GTDB, etc... then progress into the more complicated pipelines that leverage pulling from GenBank, extract-seq-segments, etc...
Should I start a separate github repo for these, for now? Or do you have a place in mind? Or I can start a DropBox to share too...
What a great plan, thanks @SoilRotifer for getting started! Yes, it would play so nicely with the Nextflow workflow @cmatzoros and I have been building.
Separate GitHub repo is a good start, we currently have our own (hopefully modular) structure in place. I'll follow up with more details via DM.
I might be sending a bunch of snakemake questions your way. Although I have a few of these pipelines producing output, they could certainly be cleaned up a bit and be more generalizable. Thinking about how to structure filenames and wildcards is surely a mental exercise!
Snakemake sounds intriguing. I have been putting together a workflow for some data and I have ended up with a bash script which encompasses every step after QC (which is done separately).
What are the advantages of using Snakemake vs compiling your own script?
Hi @Mike_Stevenson, I've abandoned snakemake and am now using NextFlow. I have a private github repo that works to construct RDP, GTDB, and SILVA classifiers for the full length and several amplicon regions. It works, I just need to better organize the outputs. I'll be adding UNITE soon.
These workflows automate handling many steps for you with minimal code. Also, is quite reproducible with conda environments, job sumissions (i.e. slurm, etc...), I hope to implement these at some point.