Argonne Sequences - Issue with demux or cutadapt - Finding Linker Primer Sequences in Rep-Seq

morgvevans · March 9, 2020, 3:31pm

Hi all,
I have been trying to process sequences generated from Argonne national lab, which should run the EMP. I had no issues importing as EMP-Paired-End, but demultiplexing either a) doesn't work (generates the 'can't map samples') or works, but after DADA2, examining the rep-seqs file reveals that the Linker Primer Sequence remains, but in the middle of the sequences? I tried using cutadapt after demultiplexing to remove the linker primer sequences, but this did not work, I assume because the Linker is in the middle and not on the ends of the seqs.
The person who emailed us the sequences said "Note that you no longer need to reverse complement the reads to demultiplex". But when I leave these options out, the demux step does not work. I have tried the "no-golay-error-correction" as well as both rev-comp settings (mapping and non mapping) individually and together with no luck, the same result occurs each time. Again, since the Linker is in the middle of the sequences I believe this is a demultiplexing issue... Please provide advice as to how to proceed. Thanks, and below is an example of my code

qiime demux emp-paired --m-barcodes-file argonne_metadata.tsv --m-barcodes-column BarcodeSequence --p-rev-comp-barcodes --i-seqs argonne_jan2020.qza --o-per-sample-sequences argonne_demux_test.qza --o-error-correction-details argonne_demux-details_test.qza

I have tried
-no-golay-error-correction
-rev-comp-barcodes
-rev-comp-barcodes and -rev-comp-mapping-barcodes
-rev-comp-mapping-barcodes

Here's an example of a barcode - ATCAGGTAAACA
Here is what all the linker primer sequences look like- GTGTGYCAGCMGCCGCGGTAA

Oh I also tried cutadapt on the demultiplexed sequences (cutadapt demux paired) and put the same metadata column for forward and reverse barcodes, this also generated the same rep-seq result after DADA2.

Thanks!

thermokarst · March 9, 2020, 11:12pm

Hi @morgvevans!

The demultiplexing step doesn't concatenate reads, so if the linker is is winding up in the middle of the sequences it is because they are being joined by DADA2 while denoising. You were on the right track earlier - you need to remove all non-biological sequence from your reads prior to running DADA2. If your linker is a fixed length, you can use the trim/trunc params of q2-dada2 to remove, otherwise, use q2-cutadapt on the demultiplexed reads.

I am pretty sure cutadapt can find nts anywhere in a read, take a closer look at this section of the cutadapt docs for more guidance:

https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types

https://docs.qiime2.org/2020.2/plugins/available/cutadapt/trim-paired/

Keep us posted, and good luck! :qiime2:

morgvevans · March 13, 2020, 2:45pm

Hi,
Thanks for the reply and assist!
Some updates. I received the demultiplexed sequences from the sequencing center and have also tried working on those.
Basically, regardless of demultiplexed and multiplexed, there is something wrong here.

In the rep-seqs from dada2, there are a handful (10-15%) of sequences that are 280bp or longer, much higher than the mean (around 200). In those sequences, and those alone, are the barcodes and linker primers. This occurs regardless of settings I place - I’ve tried trimming the first 33 bp from the reads using dada2 - since this is the length of the linker primer and barcode - and it occurs in every case, whether I started multiplexed or demultiplexed.

We also had a handful of samples that the center said did not amplify well - I removed those from the analysis altogether, and this phenomenon still occurs. It basically seems like the commands to trim don’t work at all on these weird specific sequences.

Can someone assist? I have attached my .qzv file from the import step, then the .qzvs from the dada2 runs (trimmed using 33bp = rep-reqs2, untrimmed = rep-seqs).

argonne_march2020.qzv (321.0 KB) argonne_march2020_rep_seqs.qzv (2.3 MB) argonne_march2020_rep_seqs2.qzv (2.1 MB) argonne_metadata_removed.tsv (42.0 KB)

thermokarst · March 13, 2020, 10:10pm

Have you tried my suggestion above, about using q2-cutadapt? If so, please share what you have done.

Thanks!

morgvevans · March 13, 2020, 11:11pm

Yes, I used the cutadapt initially on the multiplexed sequences with no change in result.

I tried both of these commands with various settings changed, no luck.

qiime cutadapt demux-paired --i-seqs argonne_jan2020.qza --m-forward-barcodes-file argonne_metadata.tsv --m-forward-barcodes-column BarcodeSequences --m-reverse-barcodes-file argonne_metadata.tsv --m-reverse-barcodes-column BarcodeSequences --o-per-sample-sequences argonne_demux_cutadapt.qza --o-untrimmed-sequences argonne_demux_cutadapt_untrimmed.qza

qiime cutadapt trim-paired --i-demultiplexed-sequences argonne_demux_cutadapt.qza --p-cores 14 --p-anywhere-f GTGTGYCAGCMGCCGCGGTAA --p-anywhere-r GTGTGYCAGCMGCCGCGGTAA --o-trimmed-sequences argonne_demux_cutadapt_linkerprimertrimmed.qza

Thanks

morgvevans · March 13, 2020, 11:18pm

To clarify I tried the first on the multiplexed sequences, the second command after demux.

thermokarst · March 13, 2020, 11:25pm

Can you rerun this command, but this time include the --verbose flag? Please include the text output that is printed to the screen. This should given us more details on what cutadapt is doing under the hood.

morgvevans · March 14, 2020, 12:03am

cutadapt_out_verbose.txt (1.4 MB)

Let me know if I should re-run dada2 with --verbose! Thank you!

thermokarst · March 16, 2020, 11:33pm

Great, thanks for sharing (and no need to share your dada2 logs, let's not worry about doing anything with these reads until you have cleaned them up)!

Did you get a chance to review that log, yourself? It looks like for most of your ~480 samples, the adapter is only found 0.0%-0.5% of the time.

This is cutadapt 2.8 with Python 3.6.7
Command line parameters: --cores 14 --error-rate 0.1 --times 1 --overlap 3 --minimum-length 1 -o /tmp/pbstmp.9661118/q2-CasavaOneEightSingleLanePerSampleDirFmt-eqof2oe7/61_S61_L001_R1_001.fastq.gz -p /tmp/pbstmp.9661118/q2-CasavaOneEightSingleLanePerSampleDirFmt-eqof2oe7/61_S61_L001_R2_001.fastq.gz --anywhere GTGTGYCAGCMGCCGCGGTAA -B GTGTGYCAGCMGCCGCGGTAA /tmp/pbstmp.9661118/qiime2-archive-xr1jy5_4/8a91d19d-672d-4305-99ae-ed6d4308b50e/data/61_S61_L001_R1_001.fastq.gz /tmp/pbstmp.9661118/qiime2-archive-xr1jy5_4/8a91d19d-672d-4305-99ae-ed6d4308b50e/data/61_S61_L001_R2_001.fastq.gz
Processing reads on 14 cores in paired-end mode ...
Finished in 1.49 s (50 us/read; 1.19 M reads/minute).

=== Summary ===

Total read pairs processed:             29,469
  Read 1 with adapter:                      67 (0.2%)
  Read 2 with adapter:                     111 (0.4%)
Pairs that were too short:                   0 (0.0%)
Pairs written (passing filters):        29,469 (100.0%)

Total basepairs processed:    14,793,438 bp
  Read 1:     7,396,719 bp
  Read 2:     7,396,719 bp
Total written (filtered):     14,792,780 bp (100.0%)
  Read 1:     7,396,495 bp
  Read 2:     7,396,285 bp

=== First read: Adapter 1 ===

Sequence: GTGTGYCAGCMGCCGCGGTAA; Type: variable 5'/3'; Length: 21; Trimmed: 67 times; Reverse-complemented: 0 times
5 times, it overlapped the 5' end of a read
62 times, it overlapped the 3' end or was within the read

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-21 bp: 2

Overview of removed sequences (5')
length	count	expect	max.err	error counts
3	4	460.5	0	4
4	1	115.1	0	1



Overview of removed sequences (3' or within)
length	count	expect	max.err	error counts
3	47	460.5	0	47
4	9	115.1	0	9
5	5	28.8	0	5
6	1	7.2	0	1


=== Second read: Adapter 2 ===

Sequence: GTGTGYCAGCMGCCGCGGTAA; Type: variable 5'/3'; Length: 21; Trimmed: 111 times; Reverse-complemented: 0 times
5 times, it overlapped the 5' end of a read
106 times, it overlapped the 3' end or was within the read

No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-21 bp: 2

Overview of removed sequences (5')
length	count	expect	max.err	error counts
3	4	460.5	0	4
6	1	7.2	0	1



Overview of removed sequences (3' or within)
length	count	expect	max.err	error counts
3	8	460.5	0	8
4	98	115.1	0	98

Does that coincide with what you're seeing? It sounds to me like the answer is "no", these observation rates are probably a lot lower than what you're seeing. I think now is the time to roll up your sleeves and get to work, you need to be able to tell cutadapt a little but more about the situation. I noticed you're using the anywhere parameters, but I'll point out this little caveat, pulled from the cutadapt docs:

This option is mostly for rescuing failed library preparations - do not use if you know which end your adapter was ligated to.

How about you check with your sequencing center and track down that information (what end is the adapter ligated to), then we can revise your q2-cutadapt command as needed. Keep us posted!

:qiime2:

morgvevans · March 17, 2020, 7:07pm

Hi @thermokarst

Thanks for all your help thus far. I reached out to Argonne last week for assistance, they will be getting back to me hopefully.

I went ahead and tried both "p-front-" and "p-adapter-" settings for cutadapt and both gave the same result as the one above - the adapter is rarely being found.

qiime cutadapt trim-paired --i-demultiplexed-sequences argonne_march2020.qza --p-cores 14 --p-front-f GTGTGYCAGCMGCCGCGGTAA --p-front-r GTGTGYCAGCMGCCGCGGTAA --o-trimmed-sequences argonne_cutadapt_front_march172020.qza --verbose

Result:
Total read pairs processed: 25,653
Read 1 with adapter: 16 (0.1%)
Read 2 with adapter: 9 (0.0%)
Pairs that were too short: 0 (0.0%)
Pairs written (passing filters): 25,653 (100.0%)

qiime cutadapt trim-paired --i-demultiplexed-sequences argonne_march2020.qza --p-cores 14 --p-adapter-f GTGTGYCAGCMGCCGCGGTAA --p-adapter-r GTGTGYCAGCMGCCGCGGTAA --o-trimmed-sequences argonne_cutadapt_adapter_march172020.qza --verbose

Result:
Total read pairs processed: 25,653
Read 1 with adapter: 9 (0.0%)
Read 2 with adapter: 31 (0.1%)
Pairs that were too short: 0 (0.0%)
Pairs written (passing filters): 25,653 (100.0%)

The seq center said she removed all adapters and linkers in the demultiplexed sequences I am now working with, so that may be why I dont find them in this cutadapt output, but then I find them in the rep-seqs from dada2, so they have to still be in there...

Here's what the sequencing tech said in response to my inquiry:

"To answer your questions, these are golay barcodes. You do not need to use the reverse compliment of these to demultiplex because the barcode is on the forward primer 515F per Parada et al. 2016. I’m going to include a layout of the primer below in case there’s been a miscommunication and this clears things up:

Field descriptions (space-delimited):

5′ Illumina adapter
Golay barcode
Forward primer pad
Forward primer linker
Forward primer (515F)

AATGATACGGCGACCACCGAGATCTACACGCT XXXXXXXXXXXX TATGGTAATT GT GTGYCAGCMGCCGCGGTAA

As a reminder, the way the primers are designed, the sequencing starts off of a primer pad, so the primers themselves don’t get sequenced, only the region amplified by the primers is sequenced."

Let me know if you have other thoughts, otherwise I will just wait for the sequencing center to sort things out on their end. Thanks again!

morgvevans · March 17, 2020, 7:43pm

Do you think it has something to do with wildcards in the linker primer ?

Thanks

thermokarst · March 23, 2020, 3:00pm

Is it possible that its just noise? Perhaps related to how DADA2 is joining your reads?

Maybe? The default in cutadapt (and q2-cutadapt) is to match wildcards in the adapters (but not in the biological sequence). You could try experimenting with that. I think now would be a good time to spend 30-45 minutes reviewing the cutadapt documentation and making sure that everything is lining up with your command invocation (w.r.t. the sequencing design).

Keep us posted!