suggestions for meta-analysis


I've been using qiime2 to work on a meta-analysis study, in which I am comparing datasets from the same type of environment in different locations. Basically what I did was to retrieve the sequences from NCBI from each study, import them into Qiime2, trim primers to look at the V4 region and denoise. After that I used the merge function for tables and representative sequences. Finally, I assigned taxonomy, constructed a phylogenetic tree (using fragment-insertion) and calculated diversity metrics.

In terms of composition each dataset looks alright, meaning very similar as what the original papers reported (that was a relief! ) But, for the moment of truth when I compared beta diversity across sites the results show that the the communities are very different (no shared ASVs across all of them, not even one!). So I am wondering if perhaps my strategy was biased, or if can I improve it somehow?

Any suggestion is very much appreciated :slight_smile:


Hi @Natali_Hernandez,
Your overall approach sounds good to me.
The first thing I'd want to check is regarding the trim/truncating parameters you used for denoising. Which denoising tool are you using, and by chance are you using different trim/truncating parameters for each run? Different truncating parameters wouldn't be an issue if these are paired-end reads that you merge, but if you are using say just the forward reads then even a single nucleotide difference between the runs will yield unique ASVs between reads. Trimming on the other hand must be the same for all of them, otherwise it will result in unique run-specific ASVs, even if there are real ASVs shared across sites.
Using fragment insertion here is a very good idea for this type of metanalysis. Does the PCoA plot of weighted-unifrac look as you expect here?
Keep in mind that even if you were to do everything right in the processing steps, since these are samples processed from different sites, likely with different wet-lab kits, and maybe even different sequencing technologies, we expect some level of batch effect. There's a whole body of literature on the topic of batch-effects in microbiome studies that you should familiarize yourself with before interpreting your results.

1 Like

Hi @Mehrbod_Estaki,

Many thanks for you response.

The dataset includes pyrosequencing and Illumina reads so I used dada2. Since I removed the primers with cutadapt before denonising the p-trim was always 0 . I did used different p-trunc values depending on the quality of the sequences, and in the dataset there are both double end and single end readings. Could this one be the issue?

Not really, I was expecting some marked clustering and this is what I got.

Although weighted unifrac looks much better than Bray-curtis.


I think I will repeat everything using the same truncating parameters to see if that makes any difference. Other than that, I don't think I can "standardise" the samples any further. And as you mentioned there is an inherent batch effect in these type of analysis which I cannot control.

Thanks :grin:

1 Like

Hi @Natali_Hernandez,
Thanks for the update!

Yes, absolutely! If you want to work with ASVs while combining different datasets like this you'll want to make sure your ASVs are targeting the exact same region AND are of the same length, because like I mentioned before even if you have 2 identical ASVs, and one has 1 extra nt at the end of it, then those 2 sequences will have unique ASV IDs. So you should make sure that the final product are all the same length. Note that simply using the same "truncating" parameter in DADA2 won't solve this because in studies that you had paired-end reads, you will still end up with longer reads after merging. You can either truncate your reads after DADA2 to the same length, or use just your forward reads and make sure your trim/truncate parameters are the same.

This actually looks like pretty clear clustering to me, assuming the colors are of your different sites? If you have samples from different "runs" but from the same site, those samples will be key in determining if what you are seeing is real or perhaps exaggerated by a "batch effect".

This is actually exactly what I expected to see if your ASVs were not of the same region/length. Those linear clusters strongly indicate that there is no shared features among those groups, which is what you indicated earlier. The reason why the weighted UniFrac looks good is because you are not relying on the ASVs themselves, but rather their phylogenetic position relative to the backbone tree they were inserted into. In fact, this is exactly (one of) the reason why the fragment-insertion tool was developed, to accomodate analyzing data across different regions with short reads. I bet once you fix all of your ASVs to the same length you'll se Bray-Curtis also start to look more like weighted-unifrac. That being said, I would still stick with the Unifrac results. More reliable in these situations.

That is true to some extent, there are some methods developed that try and mitigate these but they will require a little bit more reading and learning on your end. Here is an example of a study that does some successful batch correction on samples from different sites and technologies like yours here. The R codes for how they do this is also available so you can give that a try to see if it helps!
Let us know how it goes :slight_smile: