Merge counts of different regions for same samples

Hi all,
I need a suggestion from this wonderful community! I have data of 70 samples of two hypervariable regions: v3v4 and v4v5, sequenced in the same Miseq run
I analyzed the information separately and I obtained similar results from the two (yeahhh).

From a biological point of view, however, it is very difficult to discuss the data while keeping two regions. My idea is, therefore, to do an in silico PCR for the V4 and use the data of the counts of both.
The idea is to sum the V4 counts obtained for each sample starting from the assumption that the two observations are independent and therefore the result would be a representative proportion of the counts.

Do you think it makes sense? is there a batch effect that I’m not considering?

thank you so much for the possible suggestions and I hope you are all well :hugs: :hugs:

Hello Anna,

Using qPCR to normalize amplicons is a good idea. I’ve heard this suggested before, but I don’t know of a methods paper on this. If you choose to try this method, I would love to see the paper.

One possible way to combine these regions together in a single feature table is to perform closed-reference OTU clustering. While it does have limitations, it’s definitely the easiest and simplest way to proceed. Might be worth trying!
https://docs.qiime2.org/2020.2/plugins/available/vsearch/cluster-features-closed-reference/

Colin

P.S. In this post, Lauren is also trying to analyze data from multiple regions at once. It’s different regions sequenced, but it’s the same underlying problem and could have the same solution.

2 Likes

Hi Colin, thank you for your suggestions!

That's a great suggestion and I will turn the advice to those who commissioned me the analyses. Unfortunately, in this case, I am only the data analyst.

Respect Lauren's post. My problem is not obtaining the fragments the post seems to focus on. I'm sorry, that came out wrong.

These are my steps method to obtain the V4 fragment from the two regions (v3v4-v4v5)

  • align all the reads (around 6 million) against 16S
  • trim at the two conservative blocks near to V4 (more or less where are designed the two standard Hiseq primers).
  • Therefore, as you also said I clustered the reads at 99% of similarity in order to avoid thousand of sequences that differ from 1 or 2 bp.

I got my table with my features that I was going to use for subsequent analyzes when the doubt arose….(tadannn... :thinking: :woman_facepalming:)

From literature and from what I have been able to see myself it is known that there is a bias of primers when different regions are amplified (and consequently during sequencing). This means that the probability of observing a sequence from a certain region is different if different primer pairs are used ... right?

So I was wondering if combining the counts of an observed region with two different probabilities can somehow alter the distribution of the counts of my reads and lead when I consider them for example in a lmer to wrong conclusions.

In my head the answer is "there should be no problems" but I'm not sure and I was looking for a comparison with the forum.

Thanks as always for all the work you do :slight_smile:

Right.

That post from Lauren starts with trying to figure out which regions they sequenced.
But then they find the regions :smile_cat: ...
and run into the same problem you have now :scream_cat:

That's exacly what happens. Some microbes get counted twice, and other not. And it's not clear how to normalize between these different regions.

I'm not sure what's the best way forward here. We talked about unifying these using a tree built with SEPP, but I'm not sure how well that worked for them.

Colin