Statistical advice - comparing samples with different indexing/metabarcoding cycle lengths

MBugay · November 13, 2024, 4:42pm

Hi all,

For indexing/metabarcoding, I have used 7 cycles. Unfortunately, a few samples do not meet the concentration threshold for my final library and I plan on re-doing them with an increased 10 cycles for indexing.

What statistical tests/methods would you recommend to compare the 7 cycle samples vs the 10 cycle samples?

Thank you.

gregcaporaso · November 13, 2024, 5:24pm

Hi @MBugay,
Is it possible to redo all of the samples with 10 cycles? That will be the most robust, as you'll be sure to not have run-specific biases impacting your results.

If not, I would recommend at least including some of the samples (maybe 5-10) that did meet your concentration threshold from the first run in your second run, re-amplifying them with 10 cycles. Then, after sequencing, confirm the similarity of those samples across the two runs. You should expect to see that the samples that are resequenced across the two runs are the ones that are most similar to each other of all of your pairs of samples. You can also do pairwise comparisons across those groups of samples to see if any taxa are systematically over or under represented in the second run. Controlling for this can be challenging if you do discover differences, so again to minimize the chance of issues, the best approach would be to redo all of the samples if you plan to redo any of them.

MBugay · November 13, 2024, 6:43pm

Hi @gregcaporaso

Unfortunately, I have limited volume of the indices and 2x 96-well plates that I would have to re-do if I were to run all the samples with 10 cycles. I definitely would re-do everything if I could.

There were 30 samples (20 in Sample Type A and 10 in Sample Type B) that didn't meet my concentration threshold (with the 7 cycles). The total sample size for Sample Type A is 120 and for B is 60. I would still have a decent sample size even if I were to exclude the 30 samples, but a concern is that I could later exclude more of the remaining samples during the data processing.

Someone suggested I re-do only the 30 samples since I would need to dilute all the samples to the same concentration for the final library, but, as you mentioned, my concern was the run-specific biases even if everything is diluted to the same concentration.

After comparing the groups (e.g., differential abundance testing with ANCOM-BC or similar?), how would you recommend controlling for over/under representation?

Thank you for the suggestion of including samples from the previous run.

gregcaporaso · November 13, 2024, 8:59pm

@MBugay, there isn't a fool-proof approach that I can recommend for controlling for over/under representation of taxa. Others might have suggestions for approaches that I'm not aware. If using ANCOMBC, I would include the sequencing run as a variable in the formula, so at least you're capturing that that is a potential confounding variable. Given your sample sizes, I might also perform the same tests on both the full data set (with the 10 cycle samples) and the data set from the first run, and confirm that similar patterns are observed. You could always consider presenting parallel results from the first run only as a supplementary analysis (assuming this is work for a publication) that confirms the findings from the full data that are presented in your main text.

I would still highly recommend including 5-10 samples that worked from the first run on the second run, so you can test whether there are obvious differences - even if you don't have a great way to correct for them, it's important to know if they are there. If I were selecting 5-10 samples to include in an analysis like this, I would try to pick ones that were as different in composition as possible (e.g., based on a taxonomy bar plot or a PCoA plot).

Good luck!