Building a database of an environment

gregcaporaso · October 11, 2018, 10:52pm

Hi @Vanessa_Fernandes, Thanks for the clarification. What you're trying to do is challenging for a couple of reasons - I apologize in advance that I don't have a simple answer for you. Qiita is designed for performing these kinds of analyses, so the easiest path to start on might be to upload your data to Qiita and perform a meta-analysis there.

There are two main challenges for this type of analysis with microbiome data:

First, different extraction chemistries and primers will lead to biases where certain organisms will be observed with some combinations of chemistry and primer pair but not with others. This may be because the extraction processes differ in their effectiveness for these organisms, and/or because the primers differ in how well they match (and therefore amplify) those organisms' gene sequences. As far as I know, there is no way around this - you'll just need to keep watch for issues - but it's possible that batch correction methods like q2-perc-norm might be helpful for this (@cduvallet may be able to comment on whether this would be a reasonable application of q2-perc-norm).

Second, when using non-overlapping primer pairs (such as the 27F/338R and 515F/806R 16S primers) some other information will be needed to link sequences that come from the same organism. To date, that information has most commonly been full-length sequences (in which case you would use a closed-reference OTU clustering approach on each data set independently, and then merge the resulting feature tables). That information could alternatively be taxonomy assignments for your sequences (in which case you could generate feature tables and taxonomy assignments for each dataset independently, and then collapse at some taxonomic level and merge the collapsed feature tables), or a phylogenetic tree (in which case you could generate feature tables and taxonomy assignments independently for each data set again, and use q2-fragment-insertion to insert your observed sequence variants into a reference tree, and then merge your feature tables - @Stefan might be able to comment on whether this is a reasonable approach). Each of these approaches has its pros and cons.

Again, I would probably think about starting with Qiita if you're new to this - if nothing else, it could give you some results to sanity check other approaches against.

Let me know if you have more questions about this and I'll try to help out.