Where do the per-sample barcodes come from in sample-metadata.tsv?

bpscherer · September 8, 2020, 2:10pm

I'm running QIIME2 2019.7 on a Conda environment. It is installed on my institution's server so I am running it remotely via command line.

I have 280 multiplexed samples that I am trying to get demultiplexed. I think I have imported correctly using the following command. They were sequenced on an Illumina miseq, and I think they are EMP type data.

qiime tools import
--type EMPPairedEndSequences
--input-path fastq
--output-path multiplexed-seqs.qza

I am having difficulty with demultiplexing. I don't know what to use for per-sample barcodes in m-barcodes-file and m-barcodes-column. I have a tsv with the forward and reverse barcodes in seperate columns, but I don't know if these barcodes are the same as the ones I need for demultiplexing. If they are, I don't know how what to do with them to get QIIME to properly demultiplex them.

I ran the following command telling QIIME to use the forward barcodes, but I got the following error message.

qiime demux emp-paired
--i-seqs multiplex-seqs.qza
--m-barcodes-file barcodes.txt
--m-barcodes-column forward
--o-per-sample-sequences demux.qza
--o-error-correction-details ErrorCorrectionDetails.qza

Plugin error from demux:

A duplicate barcode was detected. The barcode TAAGGCGA was observed for samples LSGBK6 and SGBK2.

Debug info has been saved to /tmp/qiime2-q2cli-err-_1y0f5wp.log

I also tried to import the data as "Multiplexed Paired End Barcode In Sequence," and use cutadapt to demultiplex, but when I go this route I end up with only 24 samples in the end, which can't be correct.

qiime tools import
--type 'MultiplexedPairedEndBarcodeInSequence'
--input-path fastq
--output-path mux.qza

qiime cutadapt demux-paired
--i-seqs mux.qza
--m-forward-barcodes-file barcodes.txt
--m-forward-barcodes-column forward
--m-reverse-barcodes-file barcodes.txt
--m-reverse-barcodes-column reverse
--o-per-sample-sequences redemux.qza
--o-untrimmed-sequences re-untrimmed.qza

Thanks for any help!!

kmz · September 8, 2020, 5:52pm

Hmm, are you sure you have the barcodes on both the forward and reverse primers? When I sequences I only have the barcodes on the reverse primer (some people may have it only on the forward primer.)

I generate a tsv file by modifying the sample tsv file in the Moving pictures tutorial. To get the barcodes I just use the string corresponding to the barcodes on the reverse primers. For example, one of my reverse primer may look like this:

AATGATACGGCGACCACCGAGATCTACACGCT XXXXXXXXXXXX TATGGTAATT GT GTGYCAGCMGCCGCGGTAA

where XXXXXXXXXXXX is the barcode. I use this in the tsv file.

Hope this helps.

EDIT: To answer the question in your title, you know which sample received which primer, and therefore you know what the barcode for that sample is.

bpscherer · September 9, 2020, 9:55pm

So I've done some more digging, and it turns out I have Dual Indexed Barcodes. There are forward and reverse barcodes for each sample. When looking at both the F and R barcodes, each sample has a unique pair, but the F or R barcodes themselves are not unique to a given sample. The image below shows the barcodes for my first 20 or so samples. As you can see, the combination of barcodes is unique, but it reuses individual barcodes. Apparently it is helpful for doing more sequences on a single run.

When I run the cutadapt command it is failing to demultiplex properly because it is looking for unique forward barcodes. Only 24 samples are making it through because that is the number of unique forward barcodes that I have.

There is a post here from @Lusanto that explains how they managed to work around this issue (Post #17 of 26). I'm going to attempt to use their workaround to see if I can get it to work.

If anyone else has found another way to get qiime2 to demultiplex Dual Indexed Barcodes, I would be most appreciative of any assistance!

bpscherer · September 4, 2020, 1:58pm

Hi all,

I recently obtained 16S sequence data from ~280 samples for my dissertation.
I got the data from the sequencer in both demultiplexed and non-demultiplexed form.

I'm most familiar with data that is already demultiplexed, but my collaborator said I should start with non-demultiplexed data because QIIME2's demultiplexing is better than Illumina's.

I've taken the demultiplexed data through DADA2 and into some alpha diversity analysis with no problems. Of the 280 samples, only a handful had less than 100 reads, and only 1 completely dropped out after DADA2. From what I can tell, the dataset as a whole is sound and should be good enough to proceed with.

However, when I have tried to import and demultiplex the other (non-demultiplexed) version of the data, I only end up with 24 samples. This version of the data consists of two large files containing all of the forward and reverse reads, respectively. I also have a file with all of the barcodes that I produced from the information given to me by the sequencing center.

Below is the code I have used to get this far, and I have also attached my demux.qzv. Thank you for any insight or ideas!

demux.qzv (300.8 KB)

qiime tools import
--type MultiplexedPairedEndBarcodeInSequence
--input-path fastq
--output-path multiplexed-seqs.qza

qiime cutadapt demux-paired
--i-seqs multiplexed-seqs.qza
--m-forward-barcodes-file barcodes.txt
--m-forward-barcodes-column forward
--m-reverse-barcodes-file barcodes.txt
--m-reverse-barcodes-column reverse
--o-per-sample-sequences demux.qza
--o-untrimmed-sequences untrimmed.qza

bpscherer · September 7, 2020, 1:53pm

Update. I got two barcode fastq files from the sequencing center, one for the forward and reverse. Is it possible to just combine these two files in some easy way? Looking at qiime tools import, I need to have my barcodes as one file.

Thanks for any help!

thermokarst · September 9, 2020, 11:27pm

I have merged your two threads @bpscherer - in the future please don't crosspost.

thermokarst · September 9, 2020, 11:31pm

Hi @bpscherer!

As you've already noted, q2-cutadapt can demultiplex UDI schemes, just not CDI (as you have). The workaround you linked to is the best option at present.

I don't agree with that statement (plus, we don't develop cutadapt, which is the tool wrapped by q2-cutadapt) - IMO you should use the demux data that was provided to you, unless you have a specific reason not to.

Keep us posted! :qiime2:

bpscherer · September 10, 2020, 12:04pm

Thanks @thermokarst, sorry about the crossposting!

I had hoped I could just use my already demultiplexed dataset, but I got about 60,000 reads that didn't demultiplex that ended up listed under an "undefined" sample." I didn't think I could do anything about them, so I wanted to at least try and demultiplex with qiime or cutadapt and see if I still had those 60,000 as undefined. My collaborator had had some issues with her data not demultiplexing well with Illumina in the past, so she was strongly encouraging me to demultiplex myself.

bpscherer · September 14, 2020, 2:11pm

Okay, so I've gotten a little further but am getting some confusing results.

I used @LuSanto 's method for demultiplexing Dual-Indexed barcodes, but when I go all the way through importing and creating a visualization, my dataset appears to be missing an enormous # of reads.

This screenshot is from my original dataset which was demultiplexed using the Illumina workflow.

This second screenshot is from the dataset I just built using the cutadapt method.

When running the cutadapt plugin you get output files of "untrimmed sequences," which in my case are much much larger than the "sample-sequence" outputs. The documentation for cutadapt says that these files are the sequences which it couldn't match to any barcodes. In my case it makes sense that these files are large, given the method of generating them. To generate them I ran cutadapt 24 times, with each run only working on 12 samples. Each run would only associate barcodes with a subset of the data, so each run should have a lot of sequences it couldn't associate with a barcode. However, I don't know if there was some additional loss of data for some reason in this process.

That said, I'm still concerned about why there is such a large discrepancy between the two datasets. My original Illumina dataset had a ton of reads which it could not associate with a barcode and pooled as "undefined." This undefined sample had more reads than any other sample. I had originally thought that this was a big issue, but now I'm not sure I can get around that.

If anyone has any help at all I would be extremely appreciative!

Brendan

cutadapt-imported.qzv (312.0 KB)
Illumina-demux.qzv (313.4 KB)

thermokarst · September 21, 2020, 3:16pm

Hi @bpscherer, can you share the commands you ran outside of QIIME 2? I suspect there is just a minor issue in how the barcode strings are specified. Thanks!

bpscherer · September 21, 2020, 3:29pm

Hi @thermoskarst,

First I ran the following to bring my multiplexed data into qiime.

qiime tools import
--type 'MultiplexedPairedEndBarcodeInSequence'
--input-path fastq
--output-path mux.qza

I also created 24 barcodes.txt files so that each one only contained unique forward barcodes.
barcodes1.txt (366 Bytes)

Next I ran the following command 24 times (changing the numbers each time) to cover my entire dataset by generating 24 redemux##.qza files.

qiime cutadapt demux-paired
--i-seqs mux.qza
--m-forward-barcodes-file barcodes/barcodes24.txt
--m-forward-barcodes-column forward
--m-reverse-barcodes-file barcodes/barcodes24.txt
--m-reverse-barcodes-column reverse
--o-per-sample-sequences demux/redemux24.qza
--o-untrimmed-sequences demux/re-untrimmed24.qza &

At this point I took the 24 redmux##.qza files and manually changed their file extensions to .zip. From there I unzipped them to obtain folders with forward and reverse fastqs for each sample. I threw all of these files into one folder, then used the following command to import them.

qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path import
--input-format CasavaOneEightSingleLanePerSampleDirFmt
--output-path imported.qza &

I then ran the imported.qza file through DADA2 and that's when I discovered that my data was mostly gone lol.

Thanks for any ideas! I'm currently forging ahead with my Illumina-demultiplexed dataset.

thermokarst · September 23, 2020, 6:17pm

Thanks for explaining! I suspect there is a step missing with intermediately recombining the output untrimmed reads with the original multiplexed reads, but I'm not sure.

You might just have better luck running this outside of QIIME 2 - modern versions of cutadapt now have a detailed protocol to follow:

https://cutadapt.readthedocs.io/en/stable/guide.html#demultiplexing-paired-end-reads-with-combinatorial-dual-indexes

Let us know how that goes for you, if you decide to try.

:qiime2:

system · October 25, 2020, 12:17am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.