Pipeline for processing metaTramscriptomic data?

Hi everybody,

I'm trying to process some metatranscriptomic data, where I first extracted the 18S rRNA reads using SortMeRNA and the PR2 database, and then umerged the Paired end reads. But after importing (which works fine), denoising with dada2 runs for ~3+ days and eventually times out on our HPC. I've run the same data set using dada2 in R and it works but also with similar time restraints, so I am not sure if what I am doing is the most efficient. Read quality looks fine when checking also.

Below is the current steps I am taking in QIIME2, if anybody has any recommendations on things to change please let me know! I have read that it is possible to skip denoising altogether and proceed via vsearch, but that eventually gives me issues down the line as well.

Thanks in advance!

(HPC configuration for reference)
#SBATCH --time=03-00:00:00 ## time format is DD-HH:MM:SS
#SBATCH --nodes=1
#SBATCH --cpus-per-task=36
#SBATCH --mem=100G ## max amount of memory per node you require

#import reads
qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path PE
--input-format CasavaOneEightSingleLanePerSampleDirFmt
--output-path PE.qza

view

qiime demux summarize
--i-data PE.qza
--o-visualization SSU-single-demux.qzv

##Denoise
qiime dada2 denoise-paired
--i-demultiplexed-seqs PE.qza
--o-table PE-demuxtable.qza
--o-representative-sequences PE-rep-seqs.qza
--p-trunc-len-f 120
--p-trunc-len-r 120
--o-denoising-stats PE-DADA2-stats.qza
--p-n-threads 36

Hi @syrenawhitner ,

This is a great question. In short: you are running an invalid procedure and this might explain the runtime issue.

dada2 is designed for amplicon sequences and assumes that reads represent the same amplicon (i.e., with the same PCR primers). dada2 also assumes that reads are not merged. Both of these are assential for the error model of dada2, which requires intact quality profiles and reads that cover the same region (so that they can be meaningfully error corrected and dereplicated).

So you should not use dada2 to denoise metatranscriptome reads:

This is fantastic! But it will still give you read fragments from different parts of the 16S, so not a valid input to dada2. It will allow you to process these otherwise with QIIME 2, though (see below).

This should also not be done with dada2.

This might be a version issue, etc — can't explain the disparity, but both in QIIME 2 and in R it is still an invalid process.

Solution: for 16S fragments derived from metatranscriptome reads, I recommend instead using closed reference OTU clustering to cluster these against reference sequences.

Would that work for your use case?