I have a bit of a conundrum, and I’d really appreciate any input / rampant speculation / recommendations / potential solutions. I have been provided with sequencing data from an experiment consisting of demultiplexed 16S amplicon data (V1-V3), with one pair of fastq files per sample. These data are 16S V1-V3 amplicons, with amplicons in mixed orientations. (The library prep and sequencing was performed by a service provider some time ago.) Forward primers are present in the the sequencing data. (Read lengths range between 275 and 301 bp, with the majority 301, 293, and 291 bp - I realize this is an oddity for Illumina data and assume some adapter removal/trimming has been performed by the provider.)
Further, some of the forward primers are preceded by an “AG.”
What would be the most reasonable way of orienting these reads appropriately so that I can then proceed with a typical QIIME2 workflow, using DADA2 to denoise?
In reading the manual for DADA2, it seems that orient.fwd in the filterAndTrim function nearly does what I need. However, I see two problems: 1) the forward primer is degenerate and the DADA2 function “only allows unambiguous nucleotides,” and 2) there are those pesky preceding “AG” dinucleotides preceding the primer in some reads.
What a pickle! Let me guess, Mr. DNA?
QIIME 2 has mixed-orientation read support forthcoming (perhaps next month’s release, no guarantees), but that may not help you since your data are already demuxed (Q2 will handle mixed-orientation reads at demux).
The good news is that your primers are still attached, so you may be able to hack together a solution with q2-cutadapt.
No promises this will work — you may need to massage it a bit and pray to the Q2 gods but give this a spin:
concatenate the forward (F) and reverse reads ® twice in swapped orders: cat sample1_F.fastq sample1_R.fastq > sample1_FR.fastq cat sample1_R.fastq sample1_F.fastq > sample1_RF.fastq
import these concatenated files to QIIME 2 using the manifest format import. Pretend that the “FR” files are your forward reads and the “RF” are your reverse reads.
Use qiime cutadapt trim-paired --p-discard-untrimmed to remover your primers (and any upstream adapters)
q2-cutadapt will discard any reads that are untrimmed, in other words any that do not contain your forward/reverse primers. The result should be that the correct orientation reads are retained and the reverse orientation reads are dropped… that is, if I am not just totally mixed around which is totally possible. You may also just get an error on input because you have duplicate sequence IDs, and if that happens I suggest using cutadapt directly (not in QIIME 2) to perform step 3 before importing to QIIME 2.
Once that successfully runs, do a few things:
look at the outputs to confirm that trimming occurred. You would expect to yield 50% fewer sequences after (q2-)cutadapt, so make sure that happened!
if all looks okay, proceed to your denoising pipeline of choice
carefully examine the output results, especially taxonomy classification — any quirks in those data (e.g., almost precisely 50% unclassified reads after taxonomy classification) could be indicators that this hack created an awful monster.
probably a spacer sequence, no worries it will come off with the primer
That would be in stand-alone dada2 (in R), not q2-dada2 (in QIIME 2), but it is great to know that exposing those options may a way to solve this in QIIME 2 in the future!
if using dada2 in R, just trim the first N nucleotides off each end, where N = length of primer + 2
Please let us know what route you take and what works/does not work. I am curious to hear whether the ugly Q2 hack I proposed above solves this…
cc:ing @wasade who has been working on a solution to similar problems recently and may have some better advice on real programmatic methods to orient reads prior to trimming/denoising!