Dmx with 4 initial read files (I1, I2, R1, R2)

Hello,

I have a question regarding the workflow for demultiplexing I used in Qiime2 with my sequencing data. I know similar questions have been asked about this (e.g. here Importing and Demultiplex process for 4 Fastq Files: R1, R2, Index1 and Index2 and here Importing multiplexed paired-end data with separated barcode sequence files) but in the end I decided to still write something, also because I know it's an issue that other users might have.

My situation:

I have soil/leaf/root microbiome sequencing data from the Illumina MiSeq, I did my libraries using standard primers and barcodes and have my paired-end reads (2x250) in the format:
I1
I2
R1
R2

Following the forum posts mentioned before I decided to give the extract.barcodes.py a try, which worked and produced one barcodes.fastq file.

I imported this back into qiime as EMP Paired-End sequences, now having my

-forward.fastq.gz
-reverse.fastq.gz
-barcodes.fastq.gz

file in one folder.

The import worked and I decided to go on with demultiplexing, now giving the demux emp-paired a try, because why not.
... Surprisingly, it worked! Or at least I think so. :exploding_head:
Now the thing I wonder is, HOW? Since I was literally just telling demux to pull the barcodes from my metadata .csv file in which I concatenated my forward and reverse barcodes (or index, meaning the D701 and D501 barcodes). How would this work in demux, since I don't even specify the length of the barcodes I used (sometimes more than 8bp) it should look for? How would it split the barcodes?
I thought maybe it is smart enough to pull this information from the (also merged!) barcodes.fastq.gz file that I fed in? Still, other parts of the concatenated barcodes might match my insert sequences somewhere else...

I know the PE dmx isn't implemented yet but somehow the results I obtained look okay, I hope you'll be able to see this if I insert a screenshot of my demux.qzv here?

and my read counts are:

image

So basically I'm really confused, also because I am not at all coming from a bioinformatics background.... I apologize if this is complete gibberish, but still hope someone might help me out :innocent:
I'd really love this to work with Qiime2, all other options are much more complicated and besides the dmx part, it is so straightforward! I worked through all other analyses with this data on qiime2, so in the end I learned a lot even if someone is now telling me that what I did was completely wrong :rofl:

Thanks a lot for your help in advance! :slight_smile:

1 Like

Hi @Jana_Greta,

I'll give this explanation a try but it would be good to have one of the developers confirm this as I'm going to base my answer on my own stipulations as what the scripts are exactly doing. Your question/concern is totally valid though, glad you brought it up!

So the reads in your forward, reverse, forward_barcode, and reverse_barcode files should all be in the same order meaning all line 1 in all those 4 files correspond to the same sample/read.
With the extract_barcodes.py script in qiime1 (at least the way we are using it in this scenario) all you're really doing is combining the forward and reverse barcodes, but they retain the same sequential order as the forward and reverse read files.

Qiime1's extract_barcodes.py however does require you to describe how long your barcodes are using the --bc1_len & --bc2_len parameters.

This may be potentially problematic, could you describe your protocol a bit better here please?

This is important because if your barcodes (as you mentioned) have variable lengths, let's say with spacers or something inserted, then this method shouldn't work properly for you and I would be concerned about your output...even though it looks as though things have worked. There has been a few discussions on the forum lately about the use of variable length barcodes and unfortunately there isn't anything currently in qiime2 to support this with dual-indexing.

When you import your combined barcodes into qiime2 they should now be combined exactly as how you would have it written under your BarcodeSequence column of your metadata file. Lets say sample1 barcode now is: ACACACACTGTGTGTG, where the ACACACAC portion comes from the forward barcode, and GTGTGTGT portion comes from the reverse barcode. Then all the demultiplexing has to do is take read1 from each file, pair them, read the barcode, then find and match it to the corresponding metadata sample. Repeat for read 2 and so on...

Now where it gets a bit tricky in your situation is that if there are some barcodes that it can't find exactly in the metadata file, I don't know if it would throw in an error stating this or simply ignore those and carry on. So if you have some variable length barcodes which didn't match I would take a second look at your output to make sure all your samples are indeed represented.

I hope I didn't confuse you any further :neutral_face:, if I did, let us know more about your barcode dual indexing methods and we can take it from there!

2 Likes

Awesome answer @Mehrbod_Estaki!

Exactly! This used to trip me up every time I thought about it, but the key is that the barcodes aren't in the sequences themselves if you have the I1/I2 files. So nothing needs to be searched within your reads.

The QIIME 2 demux implementation will discard anything that doesn't match the barcodes it received.


How does your Per-sample sequence counts table in your viz look (its on the first tab, scroll down)? Are you missing any samples?

P.S. you can attach .qza/.qzv files to the forum as well (as long as they aren't too big).

2 Likes

Wow, thank you both for your quick and thorough answers, very helpful! I'm super excited someone even read this :smiley:

This is great! Nothing is attached to or searched within the reads, just looking for the right order in all three files is how I get my sample names associated with the reads, right?!

About the barcode length: I'm really sorry for the confusion, I looked back through all my indices/barcodes and they are all 8bp long (I used the Illumina D5... and D7... as well as some N/A pairs). I must have counted the linker, too...
So that means as long as my index reads which I concatenated to a 16bp string (as @Mehrbod_Estaki wrote in his reply) are found somewhere in my reads, they can be matched and nothing is discarded, right?

As for the sequence counts:
I have around 50K sequence counts in the mean, max is 200K and lowest sample (water control) 120. Is this okay?

I was a bit confused since after the dmx, I have around 11 Mio reads (or sequence counts) and after dada2, I only have about 1/10 left (1.4Mio total frequency). Is this a normal number or am I doing something wrong? I only timmed the first 5 bases and my coverage seems to be okay even there.... I'm attaching my demux.qzv and dada-table.qzv, hopefully you'll be able to see it.

demux.qzv (291.3 KB)
dada-table.qzv (591.0 KB)

Another question: dmx at the moment does not allow mismatches in the barcodes. What is your experience, how drastically would the read count change if I would for example allow 1 mismatch? Or is there another way can I recover more of my sequences? :slight_smile:

Thanks again for the help!!

Hi @Jana_Greta,

Glad you find our replies useful! Someone will always read...

Oh good, that makes things a lot simpler!

Right! Nothing is discarded as long as there is a recognizable barcode associated with that read.

Your sequence counts spread look fine to me, though I do have some concerns about the results (explained below)

Going from 11 mil reads to 1.4 MIL after DADA2 is a bit on the high end of things from what I've experienced but I know others have certainly reported this so not really anything to worry about at the moment but a potential step to take a closer look later on if something doesn't seem right. This to me just indicates that there were a lot of chimeras or singletons that had to be dealt with.

Good question! Personally I would be very strict at this step and allow no mismatches. It's too risky to assign reads to wrong samples, it can have drastic effects on the results. Just not worth the risk in my opinion.

Thanks for providing your artifacts, they are very helpful. Your forward and reverse reads look like they are in great shape...suspiciously good even :stuck_out_tongue: Probably wouldn't even need to do any trimming and truncating really, perhaps truncate at 240 bp on the reverse reads.

My concern comes from looking at the feature-detail tab in your table artifact.There is a tremendous amount of diversity, which I guess is possible in soil/roots samples but a vast majority of them only occur in 1 sample. This may be true based on your experiment set up but that output structure will do you no favors when it comes to doing stats as it will inherit a lot of noise. We might find out more once you assign some taxonomy to those reads. What is the amplicon target? 16S? Do you know the overlap coverage of the primer set? I'm wondering if it might be worth redoing your denoising with a larger number of training sequences given the vast diversity in this data set. @Nicholas_Bokulich might be able to make better sense of this, or maybe my concern are not even really justified!

1 Like

Hello and thanks again for the answer,
also I'm sorry that I was mute for a while - busy times in microbiome research! But I certainly have some answers (and of course some questions as well :wink: ).

All in all it seems this then would really be a way to work with data in the 4 file format, which is great.
That's also good news about dada/filtering and the issue with singletons/chimeras, I will keep my eyes open for that later on. I will repeat dada with the trimming you suggested just to see what changes. Interesting with the barcodes... I guess being stringent is justifiable! :innocent:

About my samples: 16S, and yes, there is a high diversity but this can be explained easily: in this run I combined two different primer pairs and just haven't separated the libraries yet.
Now I wonder, is it wrong to OTU assign and cluster together or should they be split right from the start? But for sure I have to tear them apart before taxonomic assignment? I'll redo this and let you know about the taxonomy!

I'm sure you're right but here I'm not sure: what do you mean by overlap coverage of the primer; and do you mean training sequences regarding tax assignment? I only know of training sequences when I train my own classifier (I now use my own trained SILVA classifier that works quite well I think!).

Thanks for the help!!

Hi @Jana_Greta,

That's certainly an important criteria to keep in mind. Do the primers target the same region or different areas? Does each sample have 2 regions sequenced or different samples have different regions. There's been a few posts about how to deal with this that may be of help. See this discussion for example with lots of other useful links.

So this relates to the area where your forward and reverse reads approaching from opposite directions meet and 'overlap'. The overlap region is what is used to merge your reads into one longer read and perhaps correct some errors. This area depends on your primers and how long the sequencing cycles was. In your case you have 2x250 cycles so a total of 500 bp read. You need to calculate based on the total size of your amplicons how much of the 500bp is overlap region. Your truncating parameters should factor this into consideration. For a more thorough explanation see here.

I was actually referring to the --p-n-reads-learn parameter in the dada2 plugin which is basically dictating how many reads should the algorithm use to train its error models, but don't worry about right now, that suggestion no longer is important in this case and the default settings should be plenty enough.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.