QIIME2 snakemake workflow tutorial - 18S/16S tag-sequencing

Tutorial: Run qiime2 for 18S or 16S tag-sequencing using snakemake

Sarah Hu

Summary:

This tutorial was written with qiime2-2019.4 for 18S tag-sequencing (but can be adopted for any metabarcoding. Below tutorial uses an 18S dataset to go from raw fastq files to a complete ASV tutorial. I will continue to update this tutorial and add to it on the GitHub repos listed below.

  1. Tag-sequencing pipeline to generate ASVs - https://github.com/shu251/tagseq-qiime2-snakemake - input raw fastq sequences, write manifest file (in R), performs QC (fastqc + multiqc), initial quality trimming (Trimmomatic), and runs through qiime2 (dada2 pipeline to generate ASVs). Then assigns taxonomy with an assigned reference DB and generates a ready to analyze output table (in R). Fully automated!!!

  2. Generate reference database as q2 artifacts - https://github.com/shu251/db-build-microeuks - to create a qiime2 compatible PR2 database for 18S tag-sequencing, subsets to the hyper variable region of specific amplicon (provide primer sequences), and trains a Naive Bayes classier, and then outputs test files to make sure all the above was successful. Output database q2 artifacts are ready to use in the tagseq snakemake

Background: Snakemake is a workflow manager that allows you to write reproducible and scalable (=robust!) computational pipelines :snake:. I wrote 2 (listed above) that, together, make up my qiime2 pipeline. I primarily run this for 18S tag-sequencing and use the Protist Ribosomal 2 - PR2 database. I use the V4 hypervariable region amplicon sequencing, but you can swap in any primers. Also, snakemake runs with SLURM, so if you’re running on HPC with slurm, you can have snakemake take care submitting the jobs.

Quick start: Adopt this pipeline for your specific tag-sequencing project: (1) clone repo, (2) edit the config.yaml file for your needs (i.e., paths to repo and raw data, primer sequences, parameters for DADA2). (3) try it with the test data to ensure everything works.


Full Tutorial:

last updated August 2019 - S. Hu
Run QIIME2 using snakemake. Requires input of raw .fastq reads. Below describes steps to set up environment and run a test dataset. This Snakemake pipeline can be easily scaled up to larger datasets and includes scripts to submit Snakemake jobs to slurm.

Pipeline

  1. Generate manifest.txt file from a directory containing all raw .fastq files in R
  2. Read in raw .fastq files and perform fastqc, Trimmomatic, and repeat fastqc on newly trimmed reads (snakemake)
  3. Modify manifest.txt file so input files are the trimmed .fastq reads
  4. Import all trimmed reads as QIIME2 artifact
  5. Remove primers with cutadapt
  6. Run DADA2, which performs additional quality trimming, filtering, and denoising steps. Also removes chimeric sequences. Finally, determines Amplicon Sequence Variants
  7. Assigns taxonomy using QIIME2 feature classifer
  8. Generates ASV count and taxonomy tables
  9. Compiles ASV count + taxonomy table in R

Before starting

1. Set up working directory & conda environment

Create and launch a conda environment to run this pipeline.

git clone https://github.com/shu251/tagseq-qiime2-snakemake.git
cd tagseq-qiime2-snakemake

# Create conda environment
conda env create --name snake-tagseq --file envs/snake.yaml 

# Enter environment
source activate snake-tagseq

# Check versions and correct environment set up
snakemake --version
## Output should be: 5.5.2

1.1 Place all raw fastq files in one place

Two options to use test data:

  • Download test data from Hu et al. 2018 using SRA explorer and search for Bioproject: PRJNA393172. Copy the output of bash script to a new file in tagseq-qiime2-snakemake/raw_data/ and execute to download.
  • Use provided bash scripts to download full or subset of sequences (this script was generated using the SRA explorer)
## To use provided bash script:
# migrate to raw_data directory
cd raw_data

# Download subset of sequences (n = 10)
bash download-subset.sh

# or download the full set:
# bash download.sh

# Exit raw_data directory
cd ..

If using your own fastq files

  • Place them in their own raw_data directory (see step to update config.yaml to tell snakemake where these are located).
  • Make sure they are labeled so the last numbers/letters in the file names designate the read pair. Input fastq files should be labeled either: Sample01_treatment1_R1.fastq.gz or Sample01_treatment1_1.fastq.gz

2. Create required files for input into Snakefile

The write-manifest.R script will input the list of fastq files it finds in raw_data/ to generate a QIIME2 specific manifest file (manifest.txt) and a text file with the SRR ID list for input into snakemake and a corresponding list of sample names. Ensure you are in an environment to run an R script. Follow these directions for more information. To execute this script, enter an R environment or if already enabled, use Rscript (see below).

Run the R code

# Run R script (make sure you are in an R-enabled environment)
Rscript write-manifest.R

Output files

  • manifest.txt: this is the manifest file you are looking for. However, the file path leads to the raw fastq reads before trimming. We will fix this later in the Snakefile. Alternatively$

  • SampleList.txt: simply a list of the SRR IDs and the sample names you actually get from the SraRunInfo.csv file you can download from SRA/NCBI.

  • Run below with dataset downloaded from SRA, or on your own paired end fastq files

  • Generate a .csv file to link SRR IDs (or raw fastq sequence names) to actual sample names (if applicable)

Why do I need this manifest.txt & SampleList.txt file?
The SRR fastq files I downloaded directly from SRA do not have actual sample names. So after searching the NCBI/SRA database here, I need to download a file that will link these uninformative SRRXXX IDs to actual sample names. One way to do this is to click on the ‘Send to:’ drop down menu on the upper right of the webpage and click File. Then select ‘RunInfo’ format and create the file. This will download a .csv file that has sample information and all the other metadata the sequence submitter provided to SRA. This information can be used for the sample-id column in the manifest.txt file which is required to run QIIME2. To run this on your own data, either make your own manifest.txt (see below) or download this .csv file in the same way. Alternatively, you can modify the write-manifest.R code to import a table that links your SRR IDs to sample names that you specify.

Make your own manifest.txt
Format your own txt file (using a text editor or a CSV file) exactly this way. Every line must be comma separated, with sample-id, followed by the path of fastq file, followed by either “forward” or “reverse”.

sample-id,absolute-filepath,direction
sample1,$PWD/raw_seqs_dir/Test01_full_L001_R1.fastq.gz,forward
sample1,$PWD/raw_seqs_dir/Test01_full_L001_R2.fastq.gz,reverse
sample2,$PWD/raw_seqs_dir/Test02_full_L001_R1.fastq.gz,forward
sample2,$PWD/raw_seqs_dir/Test02_full_L001_R2.fastq.gz,reverse
  • Replace $PWD with your path
  • The fastq files can be gziped.
  • List all of your fastq files.
  • Save the file and name it “manifest.txt”.

3. Modify your config file to run snakemake

Now, take a look at config.yaml. Below is a breakdown of the parameters you need to revise in your config file. Edit this file to fit your computer set-up.
This config.yaml file is set up to run with the test sequences downloaded from above:

proj_name: Diel_18S Replace this with your project name. This will be what the final ASV qiime2 artifact and ASV table is named.
raw_data: /vortexfs1/omics/huber/shu/tagseq-qiime2-snakemake/raw_data Point config file to location of your raw fastq reads, in this case a directory called ‘raw_data’
scratch: /vortexfs1/scratch/sarahhu Change this to where you want all your outputs to be written. Since I am working on a HPC, I write all of this to scratch and move it later.
manifest: manifest.txt Input of manifest file that you need to provide, generated using R script or created by you.
sample_names: SampleList.txt Sample list output from the R script used above
manifest-trimmed: manifest-trimmed.txt Final manifest file, the Snakemake pipeline will create this file, as the first few steps generate the trimmed fastq files which we ACTUALLY want to use in the QIIME2 part, so the Snakemake file is written to input your manifest file and modify the filepath

The config file also includes parameters you set when generating ASVs.
primerF: CCAGCASCYGCGGTAATTCC Forward primer sequence, in this case I’m using the V4 hypervariable region
primerR: ACTTTCGTTCTTGATYRA Reverse primer sequence

Modify primer removal step with error and required overlap

primer_err: 0.4
primer_overlap: 3

For the DADA2 step to determine ASVs, you also need to specify the forward and reverse read truncation

truncation_err: 2
truncation_len-f: 200
truncation_len-r: 200

#Truncate sequences at the 3' end when sequence quality may decrease
--p-trunc-len-f #forward read truncation
--p-trunc-len-r #reverse read truncation

#Trim forward and reverse reads at 5' end based on quality
--p-trim-left-f #forward read trim
--p-trim-left-r #reverse read trim

quality_err: 2
training: 1000 #should be set higher for a non-test dataset
chimera: pooled
#Quality threshold for above trimming
--p-trunc-q

#Any reads with expected error higher than this value will be removed
--p-max-ee

#Choice of chimera removal - Choices('pooled', 'none', 'consensus')
--p-chimera-method

#Number of reads to consider in training set for error model
--p-n-reads-learn # defaul is 1 million

Ahead of running this pipeline, prepare a database so the reference sequences can be assigned a taxonomy.
You can use this pipeline to do this.
In the config.yaml file, change this line to direct the Snakefile to where your database is stored.

## Path to your database for assigning taxonomy
database: /vortexfs1/omics/huber/shu/db/pr2-db/V4-pr2_4.11.1-classifier.qza

4. Test snakemake dry run & execute full pipeline

snakemake -np
# output should be all green and display no errors

To run the full pipeline make sure you enable the --use-conda flag. This is because snakemake uses the conda environments stored in envs/ to execute rules.

snakemake --use-conda

5. Run on HPC with SLURM

Read about executing snakemake on a cluster and another explanation on how to execute with a submit script can be found here.
Review the submit scripts available in submitscripts. Files in this directory include another config file called cluster.yaml, and two scripts to submit your snakemake pipeline to the cluster with and without the dry run option.
First, open and modify the cluster.yaml to fit your machine. Then test run using the provided submit scripts.

# Make sure you are still in the snake-tagseq conda environment
bash submitscripts/submit-slurm-dry.sh

Outputs should all be green and will report how many jobs and rules snakemake plans to run. This will allow you to evaluate any error and syntax messages.

Once ready, use the submit-slurm.sh script to submit the jobs to slurm. Run with screen, tmux, or nohup.

bash submitscripts/submit-slurm.sh
# This will print jobs submitted as it runs. Monitor these jobs using ```squeue```

6. Output from pipeline

  • ASV table: [PROJ]-asv-table.tsv includes samples by columns and ASVs by row. Values represent number of sequences per ASV (row) in a given sample (column).
  • Assigned taxonomy: tax_dir/taxonomy.tsv represents the full taxonomic name for each ASV. The same ASV identifer (string of letters and numbers under ‘Feature ID’) as the asv-table.

To combine these table, you can run the R script from the ../../qiime2/asv/ directory.

# Migrate to directory with .qza and .tsv outputs from snakemake pipeline
cd ../../../qiime2/asv/

# Ensure R is enabled
# conda activate r_3.5.1 # to activate the R conda environment

# Path to R script
Rscript /vortexfs1/omics/huber/shu/tagseq-qiime2-snakemake/make-asv-table.R
  • CountTable-wtax-DATE.txt - ASV table with counts per sample and the taxonomic identities
  • Output_stats.txt - quick stats on results, including how many sequences per sample, and how many ASVs with only 1 or 2 sequences are in final results

Find an introduction to R for processing ASV or OTU tables here.

7. qiime2 visualization option

QIIME2 offers away to visualize the data types (artifact files) using an interative viewer. An output directory of these .qzv files will be created at the end of the Snakefile run. These files can be brought locally and drag and dropped to here. In the above QC steps, whenever a .qza file was worked on, you had the option to run this:

# ../../qiime2/asv/ #In this directory

# Activate a qiime2 environment (see instructions on QIIME2 website)
conda activate qiime2-2019.4

# Insert any of the .qza artifact files generated
qiime demux summarize --i-data PROJECT-STEP.qza --o-visualization PROJECT-STEP.qzv

  • Köster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012. https://doi.org/10.1093/bioinformatics/bts480.
  • Bolyen E. et al. 2019. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology 37: 852–857. https://doi.org/10.1038/s41587-019-0209-9
  • Wingett SW, Andrews S. FastQ Screen: A tool for multi-genome mapping and quality control. F1000Res. 2018 Aug 24 [revised 2018 Jan 1];7:1338. doi: 10.12688/f1000research.15931.2. eCollection - https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  • Ewels P., et al. 2016. MultiQC: summarize analysis results for multiple tools and samples in a single report. https://doi.org/10.1093/bioinformatics/btw354.
  • Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.
  • Martin, M., 2011. Cutadapt Removes Adapter Sequences From High-Throughput Sequencing Reads. https://doi.org/10.14806/ej.17.1.200.
  • Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods. 2016;13(7):581–583. doi:10.1038/nmeth.3869
  • R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/

7 Likes

I’ve recently updated this pipeline to do either OTU clustering or ASV determination. :sunglasses:

1 Like

Hi @shu251,

Thanks for sharing your tutorial. I’m just getting familar with Snakemake and I have a question about setting the truncation parameters for DADA2.

I’m not particularly comfortable specifying this without checking the interactive quality score of the reads first. Is it possible to start the pipeline, pause before DADA2, modify the config file, then resume the pipeline?

Thanks for your help.
Laura

1 Like

Hi Laura,
Good question - Maybe I can add a snakemake option to run all the way through QC, just to check all the sequences. For now, my suggestions:

  • Edit the submit script if using slurm or directly add --until to your snakemake command. If you did this to the step right before DADA2, snakemake should stop when it reaches the rm_primers rule. Once you resume the pipeline with the original command, snakemake should detect that the files before DADA2 have already been made and will go straight to the dada2 step.
  • An alternative is to make a copy of the ASV snakemake and delete the lines you don’t want. Then pick up with generating the .qzv file after the remove primer step. See the ‘Snakefile-ASV’ file (https://github.com/shu251/tagseq-qiime2-snakemake)

Hope that is helpful! I will try to add that at some point to the submit script options - it would be helpful for QCing.

Sarah

2 Likes