does q2-cutadapt support dual indexed reads?

LuSanto · September 28, 2019, 12:36am

Hi, any news on this?

I have an R1.fastq and the corresponding R2.fastq including multiple samples. I could not find a way to demultiplex into per sample .fastq files, given that a dual-indexing strategy was used. The only alternative seems to be bcl2fastq, but I don't have access and people at the sequencing facility couldn't manage to do it either.

Nicholas_Bokulich · September 28, 2019, 12:41am

Yes, see the help docs and note the reverse barcode option:
https://docs.qiime2.org/2019.7/plugins/available/cutadapt/demux-paired/

LuSanto · October 1, 2019, 3:36pm

Thanks!

Based on the info provided, I’m running this command:

qiime cutadapt demux-paired --i-seqs raw-data.qza --m-forward-barcodes-file metadata.tsv --m-forward-barcodes-column BarcodeSequenceForward --m-reverse-barcodes-file metadata.tsv --m-reverse-barcodes-column BarcodeSequenceReverse --o-per-sample-sequences cutadapt-demux.qza --o-untrimmed-sequences unmatched-barcodes.qza

But I get this error:

Invalid value for "--i-seqs": Expected an artifact of at least type MultiplexedPairedEndBarcodeInSequence.

So I am trying to re-import the raw data using:

qiime tools import --type 'SampleData[MultiplexedPairedEndBarcodeInSequence]' --input-path manifest.txt --output-path raw-data.qza --input-format PairedEndFastqManifestPhred33V2

But I get this error:

Semantic type SampleData[MultiplexedPairedEndBarcodeInSequence] does not have a compatible directory format.

If I check with:

qiime tools import --show-importable-types

This is what I get for SampleData:
...
SampleData[AlphaDiversity]
SampleData[BooleanSeries]
SampleData[ClassifierPredictions]
SampleData[DADA2Stats]
SampleData[FirstDifferences]
SampleData[JoinedSequencesWithQuality]
SampleData[PairedEndSequencesWithQuality]
SampleData[Probabilities]
SampleData[RegressorPredictions]
SampleData[SequencesWithQuality]
SampleData[Sequences]
...

Do I have to somehow add SampleData[MultiplexedPairedEndBarcodeInSequence] as a semantic type? Or use a different semantic type for importing?

Nicholas_Bokulich · October 1, 2019, 3:45pm

The manifest format implies that your reads are already demultiplexed, so is incompatible with the MultiplexedPairedEndBarcodeInSequence format.

See here for an importing example (you will need to adjust of course for paired-end, since this is a single-end example but same idea):

LuSanto · October 1, 2019, 6:16pm

Thanks. These are the final commands for importing and demultiplexing, and both run with no errors.

qiime tools import --type MultiplexedPairedEndBarcodeInSequence --input-path raw-data --output-path raw-data.qza

qiime cutadapt demux-paired --i-seqs raw-data.qza --m-forward-barcodes-file metadata.tsv --m-forward-barcodes-column BarcodeSequenceForward --m-reverse-barcodes-file metadata.tsv --m-reverse-barcodes-column BarcodeSequenceReverse --o-per-sample-sequences cutadapt-demux.qza --o-untrimmed-sequences unmatched-barcodes.qza

However, I run into a new problem. Sequences are only being demultiplexed based on the F barcode. Furthermore, as the F barcodes are repeated, they are only used for demultiplexing the first time they appear in the metadata file. I’m attaching a part of metadata.tsv and the manifest obtained after demultiplexing. Could the problem be the repeated F barcodes? Why aren't the R primers being used? I tried inputing R barcodes in either 5'-3' or 3'-5' orientation. MANIFEST.txt (585 Bytes) metadata.tsv (713 Bytes)

thermokarst · October 1, 2019, 6:18pm

Please rerun this command with the --verbose flag and paste the results here, this will tell us what cutadapt is matching on. My guess is you need to reverse complement your reverse barcodes.

LuSanto · October 1, 2019, 6:30pm

Command: cutadapt --front file:/var/folders/z7/2djb6s4j1hjdj1vvp9p4bm940000gn/T/tmpaq98eu4z --error-rate 0.1 --minimum-length 1 -o /var/folders/z7/2djb6s4j1hjdj1vvp9p4bm940000gn/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-6r3jfxn0/{name}.1.fastq.gz --untrimmed-output /var/folders/z7/2djb6s4j1hjdj1vvp9p4bm940000gn/T/q2-MultiplexedPairedEndBarcodeInSequenceDirFmt-omma8zf8/forward.fastq.gz --pair-adapters -G file:/var/folders/z7/2djb6s4j1hjdj1vvp9p4bm940000gn/T/tmpz0ucd4n5 -p /var/folders/z7/2djb6s4j1hjdj1vvp9p4bm940000gn/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-6r3jfxn0/{name}.2.fastq.gz --untrimmed-paired-output /var/folders/z7/2djb6s4j1hjdj1vvp9p4bm940000gn/T/q2-MultiplexedPairedEndBarcodeInSequenceDirFmt-omma8zf8/reverse.fastq.gz /var/folders/z7/2djb6s4j1hjdj1vvp9p4bm940000gn/T/qiime2-archive-rrj3e5zp/df1aaf33-eaf9-4944-bb88-aff020a1c541/data/forward.fastq.gz /var/folders/z7/2djb6s4j1hjdj1vvp9p4bm940000gn/T/qiime2-archive-rrj3e5zp/df1aaf33-eaf9-4944-bb88-aff020a1c541/data/reverse.fastq.gz

This is cutadapt 2.4 with Python 3.6.7

Command line parameters: --front file:/var/folders/z7/2djb6s4j1hjdj1vvp9p4bm940000gn/T/tmpaq98eu4z --error-rate 0.1 --minimum-length 1 -o /var/folders/z7/2djb6s4j1hjdj1vvp9p4bm940000gn/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-6r3jfxn0/{name}.1.fastq.gz --untrimmed-output /var/folders/z7/2djb6s4j1hjdj1vvp9p4bm940000gn/T/q2-MultiplexedPairedEndBarcodeInSequenceDirFmt-omma8zf8/forward.fastq.gz --pair-adapters -G file:/var/folders/z7/2djb6s4j1hjdj1vvp9p4bm940000gn/T/tmpz0ucd4n5 -p /var/folders/z7/2djb6s4j1hjdj1vvp9p4bm940000gn/T/q2-CasavaOneEightSingleLanePerSampleDirFmt-6r3jfxn0/{name}.2.fastq.gz --untrimmed-paired-output /var/folders/z7/2djb6s4j1hjdj1vvp9p4bm940000gn/T/q2-MultiplexedPairedEndBarcodeInSequenceDirFmt-omma8zf8/reverse.fastq.gz /var/folders/z7/2djb6s4j1hjdj1vvp9p4bm940000gn/T/qiime2-archive-rrj3e5zp/df1aaf33-eaf9-4944-bb88-aff020a1c541/data/forward.fastq.gz /var/folders/z7/2djb6s4j1hjdj1vvp9p4bm940000gn/T/qiime2-archive-rrj3e5zp/df1aaf33-eaf9-4944-bb88-aff020a1c541/data/reverse.fastq.gz

Processing reads on 1 core in paired-end mode ...

[ 8<--] 00:01:45 1,214,093 reads @ 87.0 µs/read; 0.69 M reads/minute

Finished in 105.66 s (87 us/read; 0.69 M reads/minute).

=== Summary ===

Total read pairs processed: 1,214,093

Read 1 with adapter: 1,175 (0.1%)

Read 2 with adapter: 1,175 (0.1%)

Pairs that were too short: 1 (0.0%)

Pairs written (passing filters): 2,428,184 (200.0%)

Total basepairs processed: 609,474,686 bp

Read 1: 304,737,343 bp

Read 2: 304,737,343 bp

Total written (filtered): 609,457,317 bp (100.0%)

Read 1: 304,726,734 bp

Read 2: 304,730,583 bp

=== First read: Adapter D3.31.15 ===

Sequence: CCTATCCT; Type: regular 5'; Length: 8; Trimmed: 453 times.

No. of allowed errors:

0-8 bp: 0

Overview of removed sequences

length count expect max.err error counts

3 263 18970.2 0 263

4 72 4742.6 0 72

5 50 1185.6 0 50

6 1 296.4 0 1

8 1 18.5 0 1

10 2 18.5 0 2

12 34 18.5 0 34

19 1 18.5 0 1

23 1 18.5 0 1

31 1 18.5 0 1

32 1 18.5 0 1

36 1 18.5 0 1

39 1 18.5 0 1

46 1 18.5 0 1

50 1 18.5 0 1

53 1 18.5 0 1

57 1 18.5 0 1

60 1 18.5 0 1

62 1 18.5 0 1

64 1 18.5 0 1

77 2 18.5 0 2

84 1 18.5 0 1

96 1 18.5 0 1

98 1 18.5 0 1

108 1 18.5 0 1

112 1 18.5 0 1

120 1 18.5 0 1

131 1 18.5 0 1

138 1 18.5 0 1

158 1 18.5 0 1

174 1 18.5 0 1

191 1 18.5 0 1

217 1 18.5 0 1

234 1 18.5 0 1

250 1 18.5 0 1

251 1 18.5 0 1

=== First read: Adapter D5.5.15 ===

Sequence: CCTATCCT; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== First read: Adapter D5.27.15 ===

Sequence: CCTATCCT; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== First read: Adapter D6.2.15 ===

Sequence: CCTATCCT; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== First read: Adapter D7.9.15 ===

Sequence: AGGCGAAG; Type: regular 5'; Length: 8; Trimmed: 205 times.

No. of allowed errors:

0-8 bp: 0

Overview of removed sequences

length count expect max.err error counts

3 65 18970.2 0 65

4 13 4742.6 0 13

5 5 1185.6 0 5

12 122 18.5 0 122

=== First read: Adapter D7.24.15 ===

Sequence: AGGCGAAG; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== First read: Adapter D8.7.15 ===

Sequence: AGGCGAAG; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== First read: Adapter D8.21.15 ===

Sequence: TAATCTTA; Type: regular 5'; Length: 8; Trimmed: 195 times.

No. of allowed errors:

0-8 bp: 0

Overview of removed sequences

length count expect max.err error counts

3 90 18970.2 0 90

4 21 4742.6 0 21

5 7 1185.6 0 7

6 4 296.4 0 4

12 73 18.5 0 73

=== First read: Adapter D9.11.15 ===

Sequence: TAATCTTA; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== First read: Adapter D9.25.15 ===

Sequence: TAATCTTA; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== First read: Adapter D10.9.15 ===

Sequence: TAATCTTA; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== First read: Adapter D10.24.15 ===

Sequence: TAATCTTA; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== First read: Adapter D11.6.15 ===

Sequence: CAGGACGT; Type: regular 5'; Length: 8; Trimmed: 153 times.

No. of allowed errors:

0-8 bp: 0

Overview of removed sequences

length count expect max.err error counts

3 69 18970.2 0 69

4 12 4742.6 0 12

5 4 1185.6 0 4

12 67 18.5 0 67

124 1 18.5 0 1

=== First read: Adapter D11.20.15 ===

Sequence: CAGGACGT; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== First read: Adapter D12.7.15 ===

Sequence: CAGGACGT; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== First read: Adapter D12.21.15 ===

Sequence: CAGGACGT; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== First read: Adapter D1.12.16 ===

Sequence: CAGGACGT; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== First read: Adapter D1.26.16 ===

Sequence: GTACTGAC; Type: regular 5'; Length: 8; Trimmed: 169 times.

No. of allowed errors:

0-8 bp: 0

Overview of removed sequences

length count expect max.err error counts

3 71 18970.2 0 71

4 16 4742.6 0 16

5 4 1185.6 0 4

6 1 296.4 0 1

7 1 74.1 0 1

12 72 18.5 0 72

35 1 18.5 0 1

36 1 18.5 0 1

122 1 18.5 0 1

157 1 18.5 0 1

=== First read: Adapter D2.26.16 ===

Sequence: GTACTGAC; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== First read: Adapter D3.9.16 ===

Sequence: GTACTGAC; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== First read: Adapter D3.24.16 ===

Sequence: GTACTGAC; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== First read: Adapter D4.8.16 ===

Sequence: GTACTGAC; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== Second read: Adapter D3.31.15 ===

Sequence: GAATTCGT; Type: regular 5'; Length: 8; Trimmed: 453 times.

No. of allowed errors:

0-8 bp: 0

Overview of removed sequences

length count expect max.err error counts

3 410 18970.2 0 410

4 31 4742.6 0 31

5 3 1185.6 0 3

6 3 296.4 0 3

7 1 74.1 0 1

10 1 18.5 0 1

12 3 18.5 0 3

22 1 18.5 0 1

=== Second read: Adapter D5.5.15 ===

Sequence: GAGATTCC; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== Second read: Adapter D5.27.15 ===

Sequence: ATTCAGAA; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== Second read: Adapter D6.2.15 ===

Sequence: CGCTCATT; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== Second read: Adapter D7.9.15 ===

Sequence: GAGATTCC; Type: regular 5'; Length: 8; Trimmed: 205 times.

No. of allowed errors:

0-8 bp: 0

Overview of removed sequences

length count expect max.err error counts

3 104 18970.2 0 104

4 96 4742.6 0 96

5 2 1185.6 0 2

8 1 18.5 0 1

22 1 18.5 0 1

194 1 18.5 0 1

=== Second read: Adapter D7.24.15 ===

Sequence: ATTCAGAA; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== Second read: Adapter D8.7.15 ===

Sequence: GAATTCGT; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== Second read: Adapter D8.21.15 ===

Sequence: CGCTCATT; Type: regular 5'; Length: 8; Trimmed: 195 times.

No. of allowed errors:

0-8 bp: 0

Overview of removed sequences

length count expect max.err error counts

3 133 18970.2 0 133

4 38 4742.6 0 38

5 10 1185.6 0 10

6 2 296.4 0 2

16 1 18.5 0 1

18 1 18.5 0 1

23 1 18.5 0 1

42 1 18.5 0 1

61 1 18.5 0 1

92 1 18.5 0 1

102 1 18.5 0 1

106 3 18.5 0 3

125 1 18.5 0 1

194 1 18.5 0 1

=== Second read: Adapter D9.11.15 ===

Sequence: GAGATTCC; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== Second read: Adapter D9.25.15 ===

Sequence: ATTCAGAA; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== Second read: Adapter D10.9.15 ===

Sequence: GAATTCGT; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== Second read: Adapter D10.24.15 ===

Sequence: CTGAAGCT; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== Second read: Adapter D11.6.15 ===

Sequence: CGCTCATT; Type: regular 5'; Length: 8; Trimmed: 153 times.

No. of allowed errors:

0-8 bp: 0

Overview of removed sequences

length count expect max.err error counts

3 113 18970.2 0 113

4 21 4742.6 0 21

5 10 1185.6 0 10

6 1 296.4 0 1

7 3 74.1 0 3

49 1 18.5 0 1

63 1 18.5 0 1

80 1 18.5 0 1

84 1 18.5 0 1

106 1 18.5 0 1

=== Second read: Adapter D11.20.15 ===

Sequence: GAGATTCC; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== Second read: Adapter D12.7.15 ===

Sequence: ATTCAGAA; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== Second read: Adapter D12.21.15 ===

Sequence: GAATTCGT; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== Second read: Adapter D1.12.16 ===

Sequence: CTGAAGCT; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== Second read: Adapter D1.26.16 ===

Sequence: CGCTCATT; Type: regular 5'; Length: 8; Trimmed: 169 times.

No. of allowed errors:

0-8 bp: 0

Overview of removed sequences

length count expect max.err error counts

3 128 18970.2 0 128

4 24 4742.6 0 24

5 7 1185.6 0 7

6 1 296.4 0 1

26 1 18.5 0 1

64 1 18.5 0 1

69 1 18.5 0 1

101 1 18.5 0 1

106 1 18.5 0 1

125 1 18.5 0 1

185 1 18.5 0 1

202 1 18.5 0 1

235 1 18.5 0 1

=== Second read: Adapter D2.26.16 ===

Sequence: GAGATTCC; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== Second read: Adapter D3.9.16 ===

Sequence: ATTCAGAA; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== Second read: Adapter D3.24.16 ===

Sequence: GAATTCGT; Type: regular 5'; Length: 8; Trimmed: 0 times.

=== Second read: Adapter D4.8.16 ===

Sequence: CTGAAGCT; Type: regular 5'; Length: 8; Trimmed: 0 times.

Saved SampleData[PairedEndSequencesWithQuality] to: cutadapt-demux.qza

Saved MultiplexedPairedEndBarcodeInSequence to: unmatched-barcodes.qza

LuSanto · October 2, 2019, 11:16am

Update: I tried with reverse complemented R barcodes, and got the same result.

ben · October 2, 2019, 12:58pm

Are these Nextera Illumina Primers? I had this same question, I tried this same thing and did a cut adapt on adapters and primers in the forward/reverse/reverse complement and then did a search and actually got similar results (MINIMAL ADAPTER/PRIMER FOUND IN ANY POSITION). I think if these are Illumina runs, which mine were, the sequencing happens after the adapter position and they are not found in the sequences.

Did you do DADA2 without cut adapt? I would suggest doing that and THEN looking at the rep-seqs. Click on some and see if the adaptor sequences are found (which will not match with the 16S sequences).

Ben

thermokarst · October 2, 2019, 11:35pm

@LuSanto, thanks for sharing your log file! Well, it looks pretty straightforward from here, only 0.1% of the reads are being matched with your barcodes. Scrolling through, it actually looks to me like something is wrong with your forward barcodes. Have you tried running the reverse complement of them?

LuSanto · October 11, 2019, 1:02pm

Hi. I double-checked the barcodes and they are correct. The reason why it seems that only a small proportion of reads are matched is because there are reads from a different project in the same fastq files. In fact, if I run DADA2 on the samples that were demultiplexed (only 5 out of 22), they look absolutely normal (about 40K reads per sample as expected). Anyways, I did try running the reverse complement of either the F or R barcodes, and I always get the same result: the same 5 samples are always demultiplexed, while the remaining 17 are not.

For some reason, cutadapt is only recognizing the F barcode the first time it appears on the metadata file. The 5 samples being demultiplexed are shown in yellow on the attached screenshot.

This is confirmed when I re-shufle sample order in the metadata file. Again, only samples in yellow are demultiplexed (correct screenshot below).

thermokarst · October 14, 2019, 5:03pm

Hi @LuSanto - you are now describing a separate, and what I believe to be, unrelated, issue. The primary issue in this post is that your forward barcodes are not identifiable in your reads. The secondary issue is the one you just posted, about only the first sample in a group of matching forward barcodes being demultiplexed. I will take a closer look at the secondary issue, but the primary issue can only be resolved by you and/or your sequencing center --- you need to make sure that the orientation of reads match that of the barcodes. The fact that you are getting such low recovery is a smoking-gun that there is an issue here.

LuSanto · October 15, 2019, 2:03pm

Hi @thermokarst, thanks again for your answer, and sorry this is becoming so long.

The issue of only the first sample in a group of F barcodes being demultiplexed was mentioned since Oct 1 and, along with the R barcode being ignored by cutadapt, ARE the main problems to me.
As mentioned, I did verify that the F and R barcode sequences and orientations are correct. Something I had not mentioned is that the same raw reads were successfully demultiplexed with the exact same barcode sequences in QIIME 1 some time ago (obviously getting an .fna, rather than the .qza now needed for DADA2). This proves that the problem are not the barcodes themselves.
I’m not sure I understand the log file, but the “0.1% reads matched” is strange. As mentioned, the 5 samples that were demultiplexed actually contain many more reads than 1,175 (about 40,000+ reads per sample). See here stats_dada2.tsv (305 Bytes).

I run out of options on my side. Would be great if this could be solved. Thanks!

thermokarst · October 15, 2019, 2:25pm

Hi @LuSanto!

No worries - that is what we are here for!

I'm sorry, I could've made my post above more clear. A better way to put it is that, yes, the "only matching on the first sample" issue is a problem, we should probably sort out the bigger issue of your barcodes not matching really any of the samples. Once we sort that out, I suspect that the "only matching on the first sample" problem will be reconciled.

That is good info to have! And I agree, that does seem to imply that these barcodes should work, but I disagree that that it "proves that the problem are not the barcodes themselves" - it is possible that QIIME 1 performed RCing for you (perhaps by default?). I'm not a QIIME 1 dev, so I can't say for sure. Either way, it doesn't change what I said before --- we need to make sure that we are able to get your reads and barcodes in the same orientation.

Any chance you didn't provide the entire log file? Maybe you only copied and pasted part of it?

Please double check the read orientation and the barcode orientation. I hope I have demonstrated above that, while this might've worked in QIIME 1, it doesn't mean that it will necessarily work without adjustment in q2-cutadapt.

Thanks!

LuSanto · October 17, 2019, 10:33pm

Quick update. I could not solve the issue of only the first sample in a group of matching forward barcodes being demultiplexed. The only way around was to split the metadata into several files, so each of them only includes a unique F barcode. Then I used cutadapt separately with each metadata file. Finally, I reimported all the .fasq files into a single .qza for downstream analysis. The separate read numbers add up to the total, so it seems that I finally have correctly demultiplexed samples (which also means that the barcodes were in the correct orientation; the confounding factor was that I picked a wrong Log file).