Is there a way to get cutadapt demux-single to run in parallel?

perdita · December 9, 2020, 2:12pm

Regardless of the argument to --p-batch-size (assuming that is the parallel control), qiime cutadapt demux-single prints out ""Processing reads on 1 core in single-end mode" and only uses one core. Is there a way to pass an argument through to cutadapt for --cores=N?

Using the 2020.8 release, ami-06f673d6f55b3c8c3, as it doesn't seem the 2020.11.1 is out yet.

Thanks.

thermokarst · December 9, 2020, 2:45pm

Hi @perdita!

Unfortunately, it isn't. Here is the help text for the batch-size parameter:

  --p-batch-size INTEGER  The number of samples cutadapt demultiplexes
    Range(0, None)        concurrently. Demultiplexing in smaller batches will
                          yield the same result with marginal speed loss, and
                          may solve "too many files" errors related to sample
                          quantity. Set to "0" to process all samples at once.

Basically, there are cases where you might not be able to demux because you have too many samples (and the host machine can't open enough files to cover all the samples), so the batch-size setting is a little "escape hatch" to help you out.

Not yet, but it's on our radar. Please note, cutadapt only just got support for multi-core demuxing last month:

https://cutadapt.readthedocs.io/en/stable/changes.html#v3-0-2020-11-10

perdita · December 9, 2020, 2:47pm

Thank you very much for the information. Guess it's just back to waiting for 7+ hours for this to finish then!

thermokarst · December 9, 2020, 2:49pm

You could also just run cutadapt on its own and then import the demultiplexed results into QIIME 2. The new 2020.11 release of q2-cutadapt has cutadapt 3 in it, so you'll be able to use the new multi-core features. The VMs will come out soon, but in the meantime you could install in your existing AMI using the "native linux" installation guide: Natively installing QIIME 2 — QIIME 2 2020.11.1 documentation

:qiime2:

perdita · December 10, 2020, 3:35pm

It is interesting that --p-batch-size 0 gives 0.03M reads/minute, and --p-batch-size 70 around 0.16M reads/minute. With around 14 million samples, I wonder what the ideal parameters or EC2 instance type is.

perdita · December 20, 2020, 6:20am

Is --p-batch-size X (X > 0) supposed to throw away all but the first X samples?

thermokarst · December 20, 2020, 2:31pm

Hi @perdita!

No, it shouldn't throw away any samples. Are you observing a case where samples are being discarded? If so please share any relevant commands and/or data.

perdita · December 20, 2020, 6:28pm

We are new to both Qiime 1 and Qiime 2, but it appears like samples are discarded under Qiime 2 2020.8 with "qiime cutadapt demux-single --p-batch-size 70":

(qiime2-2020.8) qiime2@ip-xxx-xx-xx-xxx:~$ wc -l Barcodes.tsv
389 Barcodes.tsv

(qiime2-2020.8) qiime2@ip-xxx-xx-xx-xxx:~$ /usr/bin/time -v qiime cutadapt demux-single --i-seqs seqs.qza --m-barcodes-file Barcodes.tsv --m-barcodes-column BarcodeSequence --p-error-rate 0 --o-per-sample-sequences test70seqs.qza --o-untrimmed-sequences test70seqs_untrimmed.qza --p-batch-size 70 --verbose
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: cutadapt --front file:/tmp/tmpaa40v14c --error-rate 0.0 --minimum-length 1 -o /tmp/q2-    CasavaOneEightSingleLanePerSampleDirFmt-x8b60w88/{name}.1.fastq.gz --untrimmed-output /tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-w66plnnj/forward.fastq.gz /tmp/qiime2-archive-vwh1xeal/ce2b8486-3126-4471-b281-aaa063f3fc54/data/forward.fastq.gz

This is cutadapt 2.10 with Python 3.6.10
Command line parameters: --front file:/tmp/tmpaa40v14c --error-rate 0.0 --minimum-length 1 -o /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-x8b60w88/{name}.1.fastq.gz --untrimmed-output /tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-w66plnnj/forward.fastq.gz /tmp/qiime2-archive-vwh1xeal/ce2b8486-3126-4471-b281-aaa063f3fc54/data/forward.fastq.gz
Processing reads on 1 core in single-end mode ...
[  8=--------] 00:05:24       940,000 reads  @    426.0 µs/read;   0.14 M reads/minute

...Some output truncated...

Command: cutadapt --front file:/tmp/tmpwgm4n_a0 --error-rate 0.0 --minimum-length 1 -o /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-x8b60w88/{name}.1.fastq.gz --untrimmed-output /tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-0sz3ee_p/forward.fastq.gz /tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-imo04mk4/forward.fastq.gz

This is cutadapt 2.10 with Python 3.6.10
Command line parameters: --front file:/tmp/tmpwgm4n_a0 --error-rate 0.0 --minimum-length 1 -o /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-x8b60w88/{name}.1.fastq.gz --untrimmed-output /tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-0sz3ee_p/forward.fastq.gz /tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-imo04mk4/forward.fastq.gz
Processing reads on 1 core in single-end mode ...

No reads processed!
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: cutadapt --front file:/tmp/tmpc77bocke --error-rate 0.0 --minimum-length 1 -o /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-x8b60w88/{name}.1.fastq.gz --untrimmed-output /tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-voo3bvdl/forward.fastq.gz /tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-0sz3ee_p/forward.fastq.gz

This is cutadapt 2.10 with Python 3.6.10
Command line parameters: --front file:/tmp/tmpc77bocke --error-rate 0.0 --minimum-length 1 -o /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-x8b60w88/{name}.1.fastq.gz --untrimmed-output /tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-voo3bvdl/forward.fastq.gz
/tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-0sz3ee_p/forward.fastq.gz
Processing reads on 1 core in single-end mode ...

No reads processed!
Saved SampleData[SequencesWithQuality] to: test70seqs.qza
Saved MultiplexedSingleEndBarcodeInSequence to: test70seqs_untrimmed.qza
Command being timed: "qiime cutadapt demux-single --i-seqs seqs.qza --m-barcodes-file Barcodes.tsv --m-barcodes-column BarcodeSequence --p-error-rate 0 --o-per-sample-sequences test70seqs.qza --o-untrimmed-sequences test70seqs_untrimmed.qza --p-batch-size 70 --verbose"
    User time (seconds): 5492.59
    System time (seconds): 25.83
    Percent of CPU this job got: 100%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 1:31:28
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 219840
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 223
    Minor (reclaiming a frame) page faults: 638359
    Voluntary context switches: 1682834
    Involuntary context switches: 484718
    Swaps: 0
    File system inputs: 5976744
    File system outputs: 16243160
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0

And then to check the output:

(qiime2-2020.8) qiime2@ip-xxx-xx-xx-xxx:~$ qiime tools extract --input-path test70seqs.qza --output-path test70seqs_extract
Extracted test70seqs.qza to directory test70seqs_extract/061c09bb-5861-478e-9459-141469b6e185

(qiime2-2020.8) qiime2@ip-xxx-xx-xx-xxx:~$ ls test70seqs_extract/061c09bb-5861-478e-9459-141469b6e185/data/ | wc -l
72

Which looks like 388 Samples went in (389-1 for column headers), and 70 samples (72 - 2 for MANIFEST and metadata.yml files) came out. When we run the same cutadapt command as above but omit batch size parameter, we get 377 entries in the extracted result (less than 388 presumably due to sequence data, but obviously more than 70).

thermokarst · December 20, 2020, 6:58pm

Thanks for sharing, @perdita.

This looks like it might be related to the bug reported here: Discarded results produced by q2-cutadapt's trim-paired when working with mixed-orientation reads

This was addressed in the 2020.11 release - can you re-run using 2020.11 to check?

If not, please share the entire log produced when running with --verbose and/or DM me links to the files to reproduce these results locally.

Bigger picture - the batch-size parameter is only there to assist with cases where you might have massive amounts of samples - unix systems have a limit on how many files can be open at a time. This parameter subdivides the samples and processes them in smaller units to get around that issue. It comes at the cost of increasing the runtime, so if you don't need it, I wouldn't use it (and it sounds like in the case above, you didn't need it).

Keep me posted.

:qiime2:

thermokarst · December 20, 2020, 9:09pm

Just a quick update - based on my local testing, I don't actually think the bug I linked to above is related to the issue you reported. I think the problem is you might not be anchoring your barcodes in your sample metadata. The anchor tells cutadapt where to search for the barcode in the forward reads. I played around with some test data, and I can confirm that if I don't anchor, but do enable the batch processing, I lose a bunch of samples. This is easily remedied by anchoring the barcodes in the metadata:

id	barcode
s0	^CGGGAATCTCCG
s1	^GCTCCGGAAGAG
s2	^GTCTATGCATGT
s3	^GGGACCAAATGA
s4	^ATAAAATATCAC
s5	^TCCATAAAGGTA
s6	^ATACCAGAGCTT
s7	^GGAAATTCCATC
s8	^TTAGCTCCGGGA
...

The position of the anchor depends on how your sequences are laid out, but the most common is for them to be on the 5' end, or thereabouts (I think). You can read more about adapter types in the cutadapt docs:

https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types

There is one other really nice benefit to anchoring - it significantly speeds up the runtime of cutadapt, because cutadapt performs a more targeted search.

No need to upgrade to 2020.11, but please update your barcodes to include a more specific anchor or adapter type, and you should hopefully be all set.

Keep us posted!

:qiime2:

perdita · December 20, 2020, 9:18pm

Interesting, and thank you very much for the additional information. Will definitely look into it! Barcode tsv file did look like:

SampleID        BarcodeSequence
A-01-Sample519wF        AAAGCCCT
A-02-Sample519wF        AAAAGTTC
A-03-Sample519wF        AAAACAAA
A-04-Sample519wF        AAACCGGG
A-05-Sample519wF        AAATAGCT
A-06-Sample519wF        AAACTCTG
A-07-Sample519wF        AAACAGCC
A-08-Sample519wF        AAACCAAT
A-09-Sample519wF        AAAACTGG
A-10-Sample519wF        AAAACCGC

perdita · January 6, 2021, 8:48pm

The 2020.11.1 release seemed to work correctly when using the --p-batch-size argument, without anchoring the barcodes.

perdita · January 9, 2021, 9:43pm

Anchoring reads, as suggested above, increases throughput to an average 1.8 M reads/minute. That is tremendous, thank you for the tip.

system · February 10, 2021, 3:43am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.