Regardless of the argument to --p-batch-size (assuming that is the parallel control), qiime cutadapt demux-single prints out ""Processing reads on 1 core in single-end mode" and only uses one core. Is there a way to pass an argument through to cutadapt for --cores=N?
Using the 2020.8 release, ami-06f673d6f55b3c8c3, as it doesn't seem the 2020.11.1 is out yet.
Unfortunately, it isn't. Here is the help text for the batch-size parameter:
--p-batch-size INTEGER The number of samples cutadapt demultiplexes
Range(0, None) concurrently. Demultiplexing in smaller batches will
yield the same result with marginal speed loss, and
may solve "too many files" errors related to sample
quantity. Set to "0" to process all samples at once.
Basically, there are cases where you might not be able to demux because you have too many samples (and the host machine can't open enough files to cover all the samples), so the batch-size setting is a little "escape hatch" to help you out.
Not yet, but it's on our radar. Please note, cutadapt only just got support for multi-core demuxing last month:
You could also just run cutadapt on its own and then import the demultiplexed results into QIIME 2. The new 2020.11 release of q2-cutadapt has cutadapt 3 in it, so you'll be able to use the new multi-core features. The VMs will come out soon, but in the meantime you could install in your existing AMI using the "native linux" installation guide: Natively installing QIIME 2 — QIIME 2 2020.11.1 documentation
It is interesting that --p-batch-size 0 gives 0.03M reads/minute, and --p-batch-size 70 around 0.16M reads/minute. With around 14 million samples, I wonder what the ideal parameters or EC2 instance type is.
No, it shouldn't throw away any samples. Are you observing a case where samples are being discarded? If so please share any relevant commands and/or data.
We are new to both Qiime 1 and Qiime 2, but it appears like samples are discarded under Qiime 2 2020.8 with "qiime cutadapt demux-single --p-batch-size 70":
(qiime2-2020.8) qiime2@ip-xxx-xx-xx-xxx:~$ wc -l Barcodes.tsv
389 Barcodes.tsv
(qiime2-2020.8) qiime2@ip-xxx-xx-xx-xxx:~$ /usr/bin/time -v qiime cutadapt demux-single --i-seqs seqs.qza --m-barcodes-file Barcodes.tsv --m-barcodes-column BarcodeSequence --p-error-rate 0 --o-per-sample-sequences test70seqs.qza --o-untrimmed-sequences test70seqs_untrimmed.qza --p-batch-size 70 --verbose
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.
Command: cutadapt --front file:/tmp/tmpaa40v14c --error-rate 0.0 --minimum-length 1 -o /tmp/q2- CasavaOneEightSingleLanePerSampleDirFmt-x8b60w88/{name}.1.fastq.gz --untrimmed-output /tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-w66plnnj/forward.fastq.gz /tmp/qiime2-archive-vwh1xeal/ce2b8486-3126-4471-b281-aaa063f3fc54/data/forward.fastq.gz
This is cutadapt 2.10 with Python 3.6.10
Command line parameters: --front file:/tmp/tmpaa40v14c --error-rate 0.0 --minimum-length 1 -o /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-x8b60w88/{name}.1.fastq.gz --untrimmed-output /tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-w66plnnj/forward.fastq.gz /tmp/qiime2-archive-vwh1xeal/ce2b8486-3126-4471-b281-aaa063f3fc54/data/forward.fastq.gz
Processing reads on 1 core in single-end mode ...
[ 8=--------] 00:05:24 940,000 reads @ 426.0 µs/read; 0.14 M reads/minute
...Some output truncated...
Command: cutadapt --front file:/tmp/tmpwgm4n_a0 --error-rate 0.0 --minimum-length 1 -o /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-x8b60w88/{name}.1.fastq.gz --untrimmed-output /tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-0sz3ee_p/forward.fastq.gz /tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-imo04mk4/forward.fastq.gz
This is cutadapt 2.10 with Python 3.6.10
Command line parameters: --front file:/tmp/tmpwgm4n_a0 --error-rate 0.0 --minimum-length 1 -o /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-x8b60w88/{name}.1.fastq.gz --untrimmed-output /tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-0sz3ee_p/forward.fastq.gz /tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-imo04mk4/forward.fastq.gz
Processing reads on 1 core in single-end mode ...
No reads processed!
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.
Command: cutadapt --front file:/tmp/tmpc77bocke --error-rate 0.0 --minimum-length 1 -o /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-x8b60w88/{name}.1.fastq.gz --untrimmed-output /tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-voo3bvdl/forward.fastq.gz /tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-0sz3ee_p/forward.fastq.gz
This is cutadapt 2.10 with Python 3.6.10
Command line parameters: --front file:/tmp/tmpc77bocke --error-rate 0.0 --minimum-length 1 -o /tmp/q2-CasavaOneEightSingleLanePerSampleDirFmt-x8b60w88/{name}.1.fastq.gz --untrimmed-output /tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-voo3bvdl/forward.fastq.gz
/tmp/q2-MultiplexedSingleEndBarcodeInSequenceDirFmt-0sz3ee_p/forward.fastq.gz
Processing reads on 1 core in single-end mode ...
No reads processed!
Saved SampleData[SequencesWithQuality] to: test70seqs.qza
Saved MultiplexedSingleEndBarcodeInSequence to: test70seqs_untrimmed.qza
Command being timed: "qiime cutadapt demux-single --i-seqs seqs.qza --m-barcodes-file Barcodes.tsv --m-barcodes-column BarcodeSequence --p-error-rate 0 --o-per-sample-sequences test70seqs.qza --o-untrimmed-sequences test70seqs_untrimmed.qza --p-batch-size 70 --verbose"
User time (seconds): 5492.59
System time (seconds): 25.83
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:31:28
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 219840
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 223
Minor (reclaiming a frame) page faults: 638359
Voluntary context switches: 1682834
Involuntary context switches: 484718
Swaps: 0
File system inputs: 5976744
File system outputs: 16243160
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Which looks like 388 Samples went in (389-1 for column headers), and 70 samples (72 - 2 for MANIFEST and metadata.yml files) came out. When we run the same cutadapt command as above but omit batch size parameter, we get 377 entries in the extracted result (less than 388 presumably due to sequence data, but obviously more than 70).
This was addressed in the 2020.11 release - can you re-run using 2020.11 to check?
If not, please share the entire log produced when running with --verbose and/or DM me links to the files to reproduce these results locally.
Bigger picture - the batch-size parameter is only there to assist with cases where you might have massive amounts of samples - unix systems have a limit on how many files can be open at a time. This parameter subdivides the samples and processes them in smaller units to get around that issue. It comes at the cost of increasing the runtime, so if you don't need it, I wouldn't use it (and it sounds like in the case above, you didn't need it).
Just a quick update - based on my local testing, I don't actually think the bug I linked to above is related to the issue you reported. I think the problem is you might not be anchoring your barcodes in your sample metadata. The anchor tells cutadapt where to search for the barcode in the forward reads. I played around with some test data, and I can confirm that if I don't anchor, but do enable the batch processing, I lose a bunch of samples. This is easily remedied by anchoring the barcodes in the metadata:
The position of the anchor depends on how your sequences are laid out, but the most common is for them to be on the 5' end, or thereabouts (I think). You can read more about adapter types in the cutadapt docs:
There is one other really nice benefit to anchoring - it significantly speeds up the runtime of cutadapt, because cutadapt performs a more targeted search.
No need to upgrade to 2020.11, but please update your barcodes to include a more specific anchor or adapter type, and you should hopefully be all set.