Multiplexing Different Genes Into one FASTQ vs. Keeping Them Separate

I ran a little experiment to see if I could save on lab time. I'm curious if you think my conclusions are sound:

Overall goal

Detect taxa from eDNA samples using 12S and cytb genes.

Experiment

Simulate what it would be like to sequence the 12S and cytb genes of one sample together with the same indexes. (E.g. ED501112s_R1_001.fastq.gz and ED5011cytb_R1_001.fastq.gz become ED5011allgenes_R1_001.fastq.gz, etc.) If the results were similar, we would save a lot of lab time and money on indexes.

What I did

Concatenated together the already-separate 12S and cytb FASTQ files for two individuals, then run through the QIIME pipeline. Compared the results from the genes run separately, to the results from a single file for each sample.

The overall pipeline

Import -> denoise with dada2 -> classify against reference database for the proper gene -> export results. For the separated files, I would use a different reference database for the 12S files and cytb files (obviously). For the files that included both 12S and cytb genes, I would classify them twice, once with a 12S reference database, once with a cytb database. Hopefully, when using the 12S database, all the cytb reads would be discarded as bad matches, while all the 12S reads would align well.

Results

Combining the genes together added a few hundred thousand reads for each sample, bringing the total to around 1 million reads per sample. Overall, the samples with both 12S and cytb reads detected slightly (~10%) fewer taxa than when the genes were separated. The taxa discarded were not "spurious" results (e.g. an orangutan in Alabama), but were potentially believable detections.

Final thoughts

  • Combining the reads gives the denoising algorithm more reads to parameterize the error finding model, so more reads were discarded as errors, potentially improving the final dataset.
  • The amount of bp trimmed could be gene specific. By combining the genes into one file and setting trim thresholds on the single file, you could be trimming more excessively than if you had left the genes separate.
  • Although the final dataset might be "better" with more reads in the samples that contained both 12S and cytb, it's hard to know if the taxa that were discarded were truly false positives.

Overall my gut says that I should leave the genes separate, do a bit more lab work, and use a bit more of my indexes. I may be improving the dataset a bit by better parameterizing the Dada2 models, but the results are hard to interpret.

Is my logic sound here? Is there an accepted practice in this situation?

Thanks for your help.

-Alex

2 Likes

Hi @alexkrohn ,

Thanks for the clearly structured questions/sections! I am moving this to "general discussion" as it is not tech support.

Long story short: I recommend keeping these separate. There are some other discussions on the forum re: pooling 16S + ITS amplicons, and using the same barcodes these can lead to various issues. One can only fit so many samples in a single sequencing run, anyway, so it might not even be economically advantageous to share barcodes. Pooling these for sequencing should be fine, as long as they have different barcodes (to make separation and analysis easier)

I don't think that this is correct. As I recall, dada2 cannot work with metagenome reads because it assumes that reads are from a single amplicon for training its error model. So a mix of amplicons might break it...

definitely! One reason to use different barcodes, so that it is very easy to split prior to trimming and denoising.

Exactly.

Go with your gut! it should only be a little more work, and will save you time/work downstream...

but if the amount of work upstream is high, then yes technically this could be done (again, many have done this with 16S + ITS and there are some commercial kits that do this) but it creates some hassles downstream.

:joy:

Good luck!

3 Likes