Best way to analyse samples from different sequencing batches separately (per batch)?

fgara · November 12, 2020, 8:22am

Hello everyone,
I hope you are well.

I have many samples from different batches of sequencing that arrive at different times (usually once a month). These are paired-end Illumina MiSeq data (2 x 300 bp).

I remember I read somewhere on this forum, that ideally I should analyze all of the samples together, or failing that, separately (per batch) but with the same set of parameters. So I did that.

However, recently I noticed that the set of parameters that worked well for my first 3 batches does not work well for my last batch. This seems to be caused by difference in the quality of the raw reads.

For example, with the first 3 batches, using the same parameters, the majority of the reads were retained after filtering. But with the last batch I lost a lot of reads after filtering (in some cases, I lost 85% of the reads). And this seems to cause problems in the downstream analyses.

Ideally, I would like to analyze each batch separately, without having to analyze all the samples together at the same time, because it would take too much computational resources and time (I have hundreds of samples in a batch).

May I ask for your advice on this please?
Do you think sticking with the same set of parameters for each batch is a good way to do this?
But in case the quality of the raw reads in a batch differ substantially that they produce a bad result with the same set of parameters (like what I'm facing right now), what should I do?

Thank you so much for your generous time and help!

jwdebelius · November 12, 2020, 10:45am

Hi @fgara,

I would recommend that you retain the same set of parameters across all your batches. This is what we did with the American Gut Project, which was sequenced in something like 100 runs. The parameters were not always optimal (we might lose 50% of the reads on a given run) but you would be surprised at how much information you get out of shallow samples. (Like 5000-10,000 sequences/sample are more than sufficient.)

If you have one run that's lower quality, maybe look into why or discuss with your sequencing facility if there are any options for a re-run. You may just also have one run that's lower quality (unfortunately). If you continue to have multiple runs where th parameters aren't working, you may also choose to re-run the earlier runs with the new parameters.

You might also considering deblur over dada2 because I find it's quicker and less computationally intensive. However, the assumptions may not be as good, so that's something to sort of keep in mind. But, I guess the question here about what works best for you.

I think the step that becomes potentially challenging is how you build your phylogenetic tree. I feel like you may need to merge the representative sequences. (Hopefully at some point, you saturate the representative sequence space). I think there are also algorithms that let you calculate your distance metrics pairwise instead of on all the samples.

Best,
Justine

fgara · November 12, 2020, 2:26pm

Hi Justine, thank you so much for sharing your valuable experience! I truly appreciate it. I really admire what you did with the American Gut Project!

Ah alright, that's what I've been trying to do.

Many thanks once again for sharing your insights and experience - I truly appreciate them

wasade · November 12, 2020, 5:05pm

Hi @fgara, to expand on @jwdebelius's comments, Deblur was designed specifically for integrating samples from multiple sequencing runs. It's application is sample independent (if --p-min-reads=1 is set). For AGP, we applied Deblur across all the sequencing runs, and removed low read count samples and features after merge.

Regardless of the sequence assessment method, I recommend including the sequencing run as a variable in the sample metadata. That variable may end up associated as a significant effect with, for example, PERMANOVA. And it may be important in your analysis to see whether a run effect is strong or weaker than the biological signals under investigation.

For a tree, regardless of analyzing per-run or over merged sequencing runs, I recommend considering fragment-insertion to ensure the backbone tree is consistent. De novo trees, with a single tree per sequencing run, would likely magnify run differences.

All the best,
Daniel

fgara · November 12, 2020, 5:45pm

Hi @wasade,

Thank you so much for your kind reply!

Ooh I see, so does this mean you would not recommend Deblur for analyzing samples per sequencing run/batch separately?

Thank you for pointing these out! Yes, I read that you used PERMANOVA in your AGP papers, but I need to learn about it in more depth. I also need to research more about the fragment-insertion method for the tree that you mentioned.

Thank you once again for your valuable advice! Much appreciated

wasade · November 12, 2020, 6:00pm

Deblur is totally okay for per run/batch analyses as well. You can see some of it's uses here.

For fragment insertion, the benchmarking performed to support the plugin is here.

Good luck!!

Best,
Daniel