I have nonoverlapping amplicons for 18S eukaryote genes (V1-V2 and V9). After merging the two datasets, I used 'qiime vsearch cluster-features-closed-reference' to cluster features.
Some features are in common in both databases (looking at the feature table) but the representative sequences include either one amplicon (V1-V2) or the other one (V9) for that feature. Why is that? I expected that whenever a feature is in common between the two datasets (V1-V2 and V9), I would find two different representative sequences, one matching the V1-V2 and the other one matching the V9 region of the long 18S sequence.
There are lots of reasons for potential these differences. Are the datasets two sequences amplified for the sample sample (i.e. V12 and V9) or are they different samples - say combining two published datasets? If its the former, its a lot easier to both explain and discuss. If its the later, there's potentially more going on that it's worth talking about.
Hi, Thanks for replying. They are two different datasets, amplification was performed on different samples. I did the V9 on some samples collected from one location and wanted to compare results with other (published) samples from another location, in which V1-V2 was targeted. Thank you! Francesca
The problem seems to be solved if I use the closed reference clustering also on the separated databases. In this way, I am able to retrieve representative sequences of both regions. A little twisted but seems efficient enough.
There are multiple other reasons that that you might see different regions when you go to do OTU clustering.
Actual biological differences between the studies. For example, if you're looking at soil in two states in the US, there are likely real differences in the community because of weather, pH, salinity, ect...
Collection, storage and extraction can all influence the observed community. These dictate what microbes show up in the sample, and so which ones will actually be sequences. If your V12 study had a bias against a hard to extract organism, it may not be present as an OTU.
Primer bais. Certain primers will target specific organisms (or give you better resolution, and therefore more broad identification). So, you might be missing things if you cluster that way.
Database coverage. I dont know the stats for 18S as well as I do for 16S, but I recently looked at database coverage with Greengenes. I had far lower coverage in the extreme regions (V13 and V8) in my 16S data, which means that some of your bias may just be because of where your database is.
...You can actually see an example of issues 3 and 4 in a paper we published a few years ago:
In these samples, the data was extracted using the exact sample protocol, and just amplified using different regions. Then, we did closed reference clustering against the same database. If you're not used to human 16S, you can basically see bodysite level differences (like oral vs fecal) from a phylum level bar chart, but for any regular, biologically relevant effect sizes, wed expect the hypervariable region to be really strong.
So, I think for your data, I'm not suprised you're seeing differences and study effects and a lack of overlap. If you're using the second cohort as a validaation or something, you may want to discuss the database issues, or other problems as one of your limitations? And you may only be able to replicate a subset of the results. Otherwise, just pay attention to how you intrepret the intersection.
I'm also curious how filtering the datasets changed yoru results. Did you like, filter so you only had an overlap? Or like, trim before you clustered? Can you explain that a bit more?
Thank you @jwdebelius ! My problem was only about being able to retrieve the representative sequences from the merged dataset. I solved it by analyzing datasets separately. But knowing about these biases is also very useful!
About the filtering, I used dada2 and trimmed the sequences based on their quality scores. I did that separately also because V1-2 and V9 have different lengths, on top of other biases.
So they you'll do alpha and beta diversity seperately and like meta-analyze the stats at the end? Because you will struggle to combine the ASV tables without a scaffold.
No, I use the separate datasets only to retrieve the rep-seqs from both regions (V1-2 and V9 - I need the rep-seqs just to deposit them in genbank - see my very first message. The rep-seqs outcome using the merged dataset was my only problem. As such, when a feature was represented in both datasets, the outcome would give me either V1-2 or V9, not both). The diversity analyses are done using the merged dataset, aware of the biases due to different regions of the gene. But this is another (very interesting) story. I hope I clarified it. Thanks very much!