cluster-features-closed-reference for non overlapping amplicons

I have nonoverlapping amplicons for 18S eukaryote genes (V1-V2 and V9). After merging the two datasets, I used 'qiime vsearch cluster-features-closed-reference' to cluster features.

Some features are in common in both databases (looking at the feature table) but the representative sequences include either one amplicon (V1-V2) or the other one (V9) for that feature. Why is that? I expected that whenever a feature is in common between the two datasets (V1-V2 and V9), I would find two different representative sequences, one matching the V1-V2 and the other one matching the V9 region of the long 18S sequence.

Thanks for the help!

Hi @francileasi,

Welcome to the :qiime2: forum!

There are lots of reasons for potential these differences. Are the datasets two sequences amplified for the sample sample (i.e. V12 and V9) or are they different samples - say combining two published datasets? If its the former, its a lot easier to both explain and discuss. If its the later, there's potentially more going on that it's worth talking about.

Best,
Justine

1 Like

Hi, Thanks for replying. They are two different datasets, amplification was performed on different samples. I did the V9 on some samples collected from one location and wanted to compare results with other (published) samples from another location, in which V1-V2 was targeted. Thank you! Francesca

1 Like

The problem seems to be solved if I use the closed reference clustering also on the separated databases. In this way, I am able to retrieve representative sequences of both regions. A little twisted but seems efficient enough.

2 Likes

Hi @francileasi,

There are multiple other reasons that that you might see different regions when you go to do OTU clustering.

  1. Actual biological differences between the studies. For example, if you're looking at soil in two states in the US, there are likely real differences in the community because of weather, pH, salinity, ect...
  2. Collection, storage and extraction can all influence the observed community. These dictate what microbes show up in the sample, and so which ones will actually be sequences. If your V12 study had a bias against a hard to extract organism, it may not be present as an OTU.
  3. Primer bais. Certain primers will target specific organisms (or give you better resolution, and therefore more broad identification). So, you might be missing things if you cluster that way.
  4. Database coverage. I dont know the stats for 18S as well as I do for 16S, but I recently looked at database coverage with Greengenes. I had far lower coverage in the extreme regions (V13 and V8) in my 16S data, which means that some of your bias may just be because of where your database is.

...You can actually see an example of issues 3 and 4 in a paper we published a few years ago:

Figure 1 from Tiny Microbes, Enormous Impacts by Debelius et al. The foru panel figures shows two sets of PCoAs from the Human Microbiome Project. The top row shows the same PCoA, of all bodysites colored by hypervariable region (left) and bodysite(right). The plot shows that PC2 splits by hypervariable region, but PC1 and PC3 vary by bodysite. The bottom row shows the oral samples only,. colored by hypervaraible region and bodysite. For oral samples, speration on PC1 is associated with the hypervariable region PC2 with oral sample.

Figure 1 from https://doi.org/10.1186/s13059-016-1086-x

In these samples, the data was extracted using the exact sample protocol, and just amplified using different regions. Then, we did closed reference clustering against the same database. If you're not used to human 16S, you can basically see bodysite level differences (like oral vs fecal) from a phylum level bar chart, but for any regular, biologically relevant effect sizes, wed expect the hypervariable region to be really strong.

So, I think for your data, I'm not suprised you're seeing differences and study effects and a lack of overlap. If you're using the second cohort as a validaation or something, you may want to discuss the database issues, or other problems as one of your limitations? And you may only be able to replicate a subset of the results. Otherwise, just pay attention to how you intrepret the intersection.

I'm also curious how filtering the datasets changed yoru results. Did you like, filter so you only had an overlap? Or like, trim before you clustered? Can you explain that a bit more?

Best,
Justine

5 Likes

Thank you @jwdebelius ! My problem was only about being able to retrieve the representative sequences from the merged dataset. I solved it by analyzing datasets separately. But knowing about these biases is also very useful!
About the filtering, I used dada2 and trimmed the sequences based on their quality scores. I did that separately also because V1-2 and V9 have different lengths, on top of other biases.

1 Like

Hi @francileasi,

So they you'll do alpha and beta diversity seperately and like meta-analyze the stats at the end? Because you will struggle to combine the ASV tables without a scaffold.

Best,
Justine

1 Like

No, I use the separate datasets only to retrieve the rep-seqs from both regions (V1-2 and V9 - I need the rep-seqs just to deposit them in genbank - see my very first message. The rep-seqs outcome using the merged dataset was my only problem. As such, when a feature was represented in both datasets, the outcome would give me either V1-2 or V9, not both). The diversity analyses are done using the merged dataset, aware of the biases due to different regions of the gene. But this is another (very interesting) story. I hope I clarified it. Thanks very much!

Hi @francileasi,

I apologize for my delay and misunderstanding. It sounds like you have things sorted out! I'll be excited to read your paper or preprint.

Best,
Justine

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.