Combining data after assigning taxa for a meta-analysis


I am completing a meta-analysis on data that has different hypervariable regions. I understand that guidelines require that closed reference OTU picking be performed, because one can not compare different hypervariable regions to each other.

However, I am curious as to why researchers cannot run de novo OTU picking on a single study, assign taxa to the OTU (using QIIME or other means), then combine the results after assigning taxa. To me, this would result in a reduced loss of data compared to closed reference OTU picking, but I don’t know what other implications might occur. Is there some sort of bias that occurs because of this? Or other problems that I don’t see?

Also, I had not seen how to merge results after FeatureTable[Frequency (table.qza.) and FeatureData[Sequence] (rep-seqs.qza), at least in QIIME 2. If anyone knows how to merge taxa tables, I would like to try both ways mentioned above to see what results are produced.

I hope my explanation made sense. If required, I can provide a list of commands that I would use to perform the two separate analyses above to better explain what I am trying to say.

Thank you for your help. I look forward to your response.

Hi @cbippert,

I think the ability to do this relies on the assumption that (a) naming conventions are consistent (which is true within your re-classified data, but not across studies), (b) organismal naming conventions make sense, and possibly © consistent behaviour within a clade. Organism naming conventions suck. I think this is becoming abundantly clear across many fields of biology in the age of molecular ecology, but it shows up a lot in microbiome research. We have major issues with polyphyletic clades, were organisms with the same genus designations belong to different branches of the phylogenetic tree. Complaints about naming conventions come up every few years. While there are definitely issues with phylogeny, there may be less confusion with phylogeny. Plus, with a reference, you buy yourself the computational advantage that someone has already clustered and constructed a phylogenetic tree for you using full length sequences, which is available for all your phylogenetic desires.

The second piece is again, a general issue with microbiome data and relates to naming. We often assume in both phylogenetic analyses and collapsed taxonomic evaluations that things that are evolutionarily similar or clustered together behave the same way. This is a questionable at best assumption, but sometimes a decent working hypothesis. I’d argue the quality of your aggregation gets weaker as you go up in taxonomic levels, but that it can also be hard to classify at lower levels.

Also, I suppose, you assume all your sequences are equally valuable and carry meaning. Given that de novo picking requires chimera slaying, you’re still discarding sequences. IMO, this assumption may be true for environmental samples. However, when you’re dealing with human data in a well defined environment, it feels to me like a disservice to the community to not use a closed reference method, since it reduces the external validity of your study.