I ran a little experiment to see if I could save on lab time. I'm curious if you think my conclusions are sound:
Overall goal
Detect taxa from eDNA samples using 12S and cytb genes.
Experiment
Simulate what it would be like to sequence the 12S and cytb genes of one sample together with the same indexes. (E.g. ED501112s_R1_001.fastq.gz and ED5011cytb_R1_001.fastq.gz become ED5011allgenes_R1_001.fastq.gz, etc.) If the results were similar, we would save a lot of lab time and money on indexes.
What I did
Concatenated together the already-separate 12S and cytb FASTQ files for two individuals, then run through the QIIME pipeline. Compared the results from the genes run separately, to the results from a single file for each sample.
The overall pipeline
Import -> denoise with dada2 -> classify against reference database for the proper gene -> export results. For the separated files, I would use a different reference database for the 12S files and cytb files (obviously). For the files that included both 12S and cytb genes, I would classify them twice, once with a 12S reference database, once with a cytb database. Hopefully, when using the 12S database, all the cytb reads would be discarded as bad matches, while all the 12S reads would align well.
Results
Combining the genes together added a few hundred thousand reads for each sample, bringing the total to around 1 million reads per sample. Overall, the samples with both 12S and cytb reads detected slightly (~10%) fewer taxa than when the genes were separated. The taxa discarded were not "spurious" results (e.g. an orangutan in Alabama), but were potentially believable detections.
Final thoughts
- Combining the reads gives the denoising algorithm more reads to parameterize the error finding model, so more reads were discarded as errors, potentially improving the final dataset.
- The amount of bp trimmed could be gene specific. By combining the genes into one file and setting trim thresholds on the single file, you could be trimming more excessively than if you had left the genes separate.
- Although the final dataset might be "better" with more reads in the samples that contained both 12S and cytb, it's hard to know if the taxa that were discarded were truly false positives.
Overall my gut says that I should leave the genes separate, do a bit more lab work, and use a bit more of my indexes. I may be improving the dataset a bit by better parameterizing the Dada2 models, but the results are hard to interpret.
Is my logic sound here? Is there an accepted practice in this situation?
Thanks for your help.
-Alex