I have an R1.fastq and the corresponding R2.fastq including multiple samples. I could not find a way to demultiplex into per sample .fastq files, given that a dual-indexing strategy was used. The only alternative seems to be bcl2fastq, but I don’t have access and people at the sequencing facility couldn’t manage to do it either.
Semantic type SampleData[MultiplexedPairedEndBarcodeInSequence] does not have a compatible directory format.
If I check with:
qiime tools import --show-importable-types
This is what I get for SampleData:
…
SampleData[AlphaDiversity]
SampleData[BooleanSeries]
SampleData[ClassifierPredictions]
SampleData[DADA2Stats]
SampleData[FirstDifferences]
SampleData[JoinedSequencesWithQuality]
SampleData[PairedEndSequencesWithQuality]
SampleData[Probabilities]
SampleData[RegressorPredictions]
SampleData[SequencesWithQuality]
SampleData[Sequences]
…
Do I have to somehow add SampleData[MultiplexedPairedEndBarcodeInSequence] as a semantic type? Or use a different semantic type for importing?
However, I run into a new problem. Sequences are only being demultiplexed based on the F barcode. Furthermore, as the F barcodes are repeated, they are only used for demultiplexing the first time they appear in the metadata file. I’m attaching a part of metadata.tsv and the manifest obtained after demultiplexing. Could the problem be the repeated F barcodes? Why aren't the R primers being used? I tried inputing R barcodes in either 5'-3' or 3'-5' orientation. MANIFEST.txt (585 Bytes) metadata.tsv (713 Bytes)
Please rerun this command with the --verbose flag and paste the results here, this will tell us what cutadapt is matching on. My guess is you need to reverse complement your reverse barcodes.
Are these Nextera Illumina Primers? I had this same question, I tried this same thing and did a cut adapt on adapters and primers in the forward/reverse/reverse complement and then did a search and actually got similar results (MINIMAL ADAPTER/PRIMER FOUND IN ANY POSITION). I think if these are Illumina runs, which mine were, the sequencing happens after the adapter position and they are not found in the sequences.
Did you do DADA2 without cut adapt? I would suggest doing that and THEN looking at the rep-seqs. Click on some and see if the adaptor sequences are found (which will not match with the 16S sequences).
@LuSanto, thanks for sharing your log file! Well, it looks pretty straightforward from here, only 0.1% of the reads are being matched with your barcodes. Scrolling through, it actually looks to me like something is wrong with your forward barcodes. Have you tried running the reverse complement of them?
Hi. I double-checked the barcodes and they are correct. The reason why it seems that only a small proportion of reads are matched is because there are reads from a different project in the same fastq files. In fact, if I run DADA2 on the samples that were demultiplexed (only 5 out of 22), they look absolutely normal (about 40K reads per sample as expected). Anyways, I did try running the reverse complement of either the F or R barcodes, and I always get the same result: the same 5 samples are always demultiplexed, while the remaining 17 are not.
For some reason, cutadapt is only recognizing the F barcode the first time it appears on the metadata file. The 5 samples being demultiplexed are shown in yellow on the attached screenshot.
Hi @LuSanto - you are now describing a separate, and what I believe to be, unrelated, issue. The primary issue in this post is that your forward barcodes are not identifiable in your reads. The secondary issue is the one you just posted, about only the first sample in a group of matching forward barcodes being demultiplexed. I will take a closer look at the secondary issue, but the primary issue can only be resolved by you and/or your sequencing center — you need to make sure that the orientation of reads match that of the barcodes. The fact that you are getting such low recovery is a smoking-gun that there is an issue here.
Hi @thermokarst, thanks again for your answer, and sorry this is becoming so long.
The issue of only the first sample in a group of F barcodes being demultiplexed was mentioned since Oct 1 and, along with the R barcode being ignored by cutadapt, ARE the main problems to me.
As mentioned, I did verify that the F and R barcode sequences and orientations are correct. Something I had not mentioned is that the same raw reads were successfully demultiplexed with the exact same barcode sequences in QIIME 1 some time ago (obviously getting an .fna, rather than the .qza now needed for DADA2). This proves that the problem are not the barcodes themselves.
I’m not sure I understand the log file, but the “0.1% reads matched” is strange. As mentioned, the 5 samples that were demultiplexed actually contain many more reads than 1,175 (about 40,000+ reads per sample). See here stats_dada2.tsv (305 Bytes).
I run out of options on my side. Would be great if this could be solved. Thanks!
I’m sorry, I could’ve made my post above more clear. A better way to put it is that, yes, the “only matching on the first sample” issue is a problem, we should probably sort out the bigger issue of your barcodes not matching really any of the samples. Once we sort that out, I suspect that the “only matching on the first sample” problem will be reconciled.
That is good info to have! And I agree, that does seem to imply that these barcodes should work, but I disagree that that it “proves that the problem are not the barcodes themselves” - it is possible that QIIME 1 performed RCing for you (perhaps by default?). I’m not a QIIME 1 dev, so I can’t say for sure. Either way, it doesn’t change what I said before — we need to make sure that we are able to get your reads and barcodes in the same orientation.
Any chance you didn’t provide the entire log file? Maybe you only copied and pasted part of it?
Please double check the read orientation and the barcode orientation. I hope I have demonstrated above that, while this might’ve worked in QIIME 1, it doesn’t mean that it will necessarily work without adjustment in q2-cutadapt.
Quick update. I could not solve the issue of only the first sample in a group of matching forward barcodes being demultiplexed. The only way around was to split the metadata into several files, so each of them only includes a unique F barcode. Then I used cutadapt separately with each metadata file. Finally, I reimported all the .fasq files into a single .qza for downstream analysis. The separate read numbers add up to the total, so it seems that I finally have correctly demultiplexed samples (which also means that the barcodes were in the correct orientation; the confounding factor was that I picked a wrong Log file).