Primer-embedded dual-indexed paired-end data importing

devonorourke · November 2, 2018, 2:34pm

In looking through the roadmap in the overview, I wasn't clear what importing strategy in QIIME was most appropriate. I (think I) have an outlier design where I have EMPpairedEnd data, except that it's got barcodes embedded in the sequence, but I don't have a third fastq file with the barcode info...
What I can say confidently is that I have a pair of *_R1 and *_R2 files for each sample (it's already demultiplexed), but the barcodes and primers are still in the reads (and possible reverse complements present in the read through on the read end).

I built these primers following Schloss' design outlined here. It looks pretty similar to EMP's dual indexed approach, right?

An example of my forward and reverse primers are as follows:

## Example PCR Primer design	
<illumina link>                <i5>	     <l10bppad>	   <2bplink>	<forwardprimer>			
AATGATACGGCGACCACCGAGATCTACAC	ATCGTACG	TATGGTAATT	CG	        GGTCAACAAATCATAAAGATATTGG
<illumina link>                <i7>	     <l10bppad>	    <2bplink>	<forwardprimer>			
CAAGCAGAAGACGGCATACGAGAT	    AACTCTCG	AGTCAGTCAG	CC	       GGWACTAATCAATTTCCAAATCC

## full Forward primer 
AATGATACGGCGACCACCGAGATCTACACATCGTACGTATGGTAATTCGGGTCAACAAATCATAAAGATATTGG
## full Reverse primer 
CAAGCAGAAGACGGCATACGAGATAACTCTCGAGTCAGTCAGCCGGWACTAATCAATTTCCAAATCC

I'm wondering if the pros at QIIME can see an obvious way to import these data.

Is it using a manifest approach? Or is that outdated (the link throws a warning about referring to old documentation)...
Maybe there's a way to follow @thermokarst's example with the cutadapt approach. Unfortunately I'm not sure that the cutadapt parameter descriptions will work in my situation because I have dual-indexed reads.

My other option would be to use cutadapt natively, or some other read-trimming program to produce a single demultiplexed file. I do indeed have a file from a process that produces a demultiplexed set of representative sequences for each sample - it looks something like this:

@R_1;barcodelabel=sample001;
AACTCTCTATTTTATTTTTGGGGCTTGAGC
+
7JJJHJJJJJJJJJJJJJJJJJJJJJJJJJ
@R_2;barcodelabel=sample001;
AGCCTACTGATCCGAGCTGAACTATTTTAG
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@R_1;barcodelabel=sample002;
AACTCTCTATTTTATTTTTGGGGCTTGAGC
+
7JJJHJJJJJJJJJJJJJJJJJJJJJJJJJ

I'm curious what the format needs to be to import that file into QIIME, and what the appropriate command for doing so would be. I would think that it's going to require a metadata file of some kind, but given the nature of how my barcodes are embedded in with the primer sequences, I wasn't sure what this would look like.

Thanks for your expertise QIIME folk!

Nicholas_Bokulich · November 2, 2018, 6:13pm

You are correct, this is an outlier — but it is not EMPpairedEnd (which is not demultiplexed).

The great news is that your data are already demultiplexed, so that resolves the issue most folks have with dual-indexed PE reads: that QIIME 2 does not yet have a method for demultiplexing these.

Yes, import as a PE manifest format. Ignore the warning about old docs — follow this link, which points to the latest version.

Yes! This will work for you. Your reads are already demultiplexed, so the dual index is not an issue for demultiplexing. You will want to use cutadapt trim-paired to trim primer/index/adapter from each PE read.

That is the key here. You just need to import as manifest format, use cutadapt trim-paired to remove primers/adapters, and proceed from there as usual (e.g., to denoising or OTU clustering).

How did you demultiplex? We get lots of user questions about demultiplexing dual-indexed PE reads and having a "community tutorial" that shows how to demux using a 3rd-party tool and import as manifest would be useful if you are interested in putting one together.

I hope that helps!

devonorourke · November 16, 2018, 7:00pm

These data were processed with Jon Palmer's amptk program, using his amptk illumina process to strip the primers and merge PE data into a single concatenated .fastq.gz file.

Working through the cutadapt portion of trimming my raw data now. The manifest import worked (I think!?). Just trying to figure out what the appropriate parameters are for my data...

devonorourke · November 17, 2018, 1:19am

Quick update and question about workflow:

I can confirm that the manifest approach that @Nicholas_Bokulich suggested worked with my data, and that the cutadapt trimming appears to have done the trick.
I'm curious how to go about setting the --p-trim parameters for the qiime dada2 denoise-paired step next; I'm following the Atacama soil tutorial, and it seems like there is no step in which the paired data are joined. Specifically, I'm wondering how I can evaluate what value to enter for the --p-trim-left-* parameters. Is it as simple as viewing the interactive quality plot from the output of?:

qiime demux summarize --i-data trimmed_sequences.qza --output-dir demuxStats

I'm working with 180bp amplicons, so the read stats seem pretty clear that somewhere around 160 -180 bases I should be thinking about trimming...

I would have thought the order of operations would be to join the paired data with the Vsearch plugin, then quality filter with q-score-joined, then go ahead with the denoising. But perhaps that's defeating the purpose of denoising?

Thanks for any help you can provide in helping me understand what the appropriate order would be once the cutadapt-trimmed data is produced.

thermokarst · November 17, 2018, 1:32am

Hey there @devonorourke!

Neither --- DADA2 joins the reads as part of its internal pipeline.

Remove any non-biological sequence using cutadapt, then use demux summarize to look at the output from cutadapt to determine appropriate trim/trunc params for both fwd and rev reads (there is a lot of discussion around that all over this forum about picking those values, as well as a lot of info in the DADA docs). Then, use those fwd/rev trim/trunc params you picked in q2-dada2.

I highly recommend reviewing the DADA2 docs (and the paper).

Keep us posted! :qiime2:

devonorourke · November 17, 2018, 1:36am

Thanks @thermokarst for the follow up and links.
It wasn't clear to me (clearly!) that DADA2 did the joining internally. That makes it a great deal simpler to understand the workflow now.
The only thing I'm stuck with at the moment is what the symbolizes...
Is it the token for signifying when you can't understand what keys to hit on the keyboard, and you get frustrated, like a t-rex would if it had to type on a keyboard too? I might be overthinking things. It's Friday at 830pm after all.

Thanks again.

system · December 18, 2018, 7:36am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.