Multi-gene amplicon sequences with dada2

Stream_biofilm · October 3, 2017, 8:53pm

Hello,

We amplified four gene markers (16S, 18S, 23S, rbcL) for each sample and combined all four markers before adding the barcode and sequencing. I now have demultiplexed sequences but each sample still contains sequences of all four amplicons. Do I need to create individual files for each sample and marker before processing with dada2? Each amplicon is a different size but I figure I can use the primer to parse them out. Is there anything in place to assist me with this?

Thanks,
Danny

ebolyen · October 9, 2017, 5:55pm

Hi @Stream_biofilm,

Sorry for the delayed response. I'm afraid I can't speak to whether DADA2 can handle multiple amplicons (of different regions) at once, but hopefully @benjjneb can tell us. In any-case, supposing you did denoise multiple amplicons, you'd still need to separate the results as those features aren't directly comparable. So you are probably best served by separating your data first, then processing each amplicon separately.

Unfortunately, we don't have any functionality to separate mixed amplicons (with shared barcordes) at this time (and we don't have any immediate plans either). You may be able to do something with cutadapt or similar tools, but we can't really help you with that here.

Sorry I couldn't be of more help!

Stream_biofilm · October 9, 2017, 7:39pm

Thanks for the response, I have a script that allows me to parse out each gene for each sample and make separate fastq files. However the result consists of similar but different numbers of forward and reverse reads for each gene within a sample. When I put the individual forward and reverse reads of a gene within a sample into dada2 I got an error -> "(1) Filtering Error in fastqPairedFilter(c(unfiltsF[[i]], unfiltsR[[i]]), c(filteredFastqF, : Mismatched forward and reverse sequence files: 59521, 59049. ". I assume this error is due to the unequal number of forward and reverse reads using my script.

benjjneb · October 9, 2017, 7:43pm

I assume this error is due to the unequal number of forward and reverse reads using my script.

Yes that is the problem. Is there a way to modify the script so that it only keeps F/R reads if they both get assigned to the same locus?

There is an option to check read IDs and match F/R pairs based on that (rather than read order) in the dada2 R package, but its not accessible through the Q2 plugin right now. Right now the plugin requires that the F/R reads match each other (same number, same order).

Stream_biofilm · October 10, 2017, 3:51pm

Would I be able to do this using the dada2 R package and then transfer the files back into qiime2? Also, would there be an issue on running dada2 with multiple genes not parsed and then separating them downstream?

benjjneb · October 10, 2017, 5:22pm

Yes* using the filterAndTrim(..., matchIDs=TRUE) function. I could help you with that at the dada2 R package forum if you'd like to post an issue there. I also wonder if one of the Q2 masters might know if a method that can do this is already included in Q2?

^ If they are in a normal format, e.g. w/ Illumina IDs. Also we do plan on bringing a lot of these functions into Q2 once Pipelines are implemented.

If they were sequenced together, then that would probably be just fine. The algorithm should easily separate the different gene regions from each other.

jairideout · October 10, 2017, 5:54pm

Unfortunately I don't know of a method in qiime2 that handles paired-end data that's not in the same record order. That would be great to expose in q2-dada2 when qiime2 is able to support pipelines (I think that's blocking this feature right now?).

ebolyen · October 10, 2017, 7:34pm

I wrote a post for importing your data after following the DADA2 tutorial. Getting the FeatureData[Sequence] artifact is pretty easy, but the FeatureTable[Frequency] takes a bit of massaging.

That is super cool!

I can't think of a good way to separate them after the fact in QIIME 2. So splitting them, then using DADA2 directly (with matchIDs=TRUE) may be your best option.

In the future we might be able to frame this problem as removing "contamination" (@Nicholas_Bokulich is working/thinking about plugins for that), but I couldn't say for sure.

Stream_biofilm · October 10, 2017, 7:48pm

Yes they are sequenced together. I tried to run the reads through q2-dada2 without separating out the genes. It produced a warning message: "Self-consistency loop terminated before convergence" after it hit selfConsist step 10. I believe I read somewhere on this forum about that warning being okay depending on the error rate

Thanks! It sounds like this might be the best option to try now. Ill report back with my progress.

I did some more research and found using the Illumina bcl2fastq program I might be able to take the raw multiplexed sequences and demultiplex them using their software but it looks a little confusing.

Stream_biofilm · October 11, 2017, 5:40pm

Is there any reason why I cant just demultiplex the samples twice? First time using the sample index and then take those and demultiplex again using the primer sequence for each gene within the sample?

ebolyen · October 12, 2017, 10:36pm

If I'm understanding correctly, you are suggesting using the primer as a barcode for the second demultiplexing step?

It'll probably be more work than it's worth to do it with QIIME 2 since demux emp-single/paired assumes that your "barcodes" are the same length, and that there is a barcodes.fastq.gz file with the "barcodes" in exactly the same order. It seems like you would need to script something to meet those assumptions, and at that point you may as well do the demultiplexing yourself. Not to mention all of the importing/exporting you would need to do.

Stream_biofilm · October 17, 2017, 7:23pm

Just an update on this issue. I was able to use repair.sh from BBTools to fix my forward and reverse fastq files putting them in the same order and making sure only sequences found in both files remain.
Thanks for the help everyone!
Danny