I am trying to gather all the sequencing that was done so far on biocrust (microbial communities that live on deserts topsoil) and trying to create a database that could be referenced to whenever information was needed about that community.
The problem is that the data currently available uses different primers and chemistries and I don’t know if there is a good way to work with this kind of data and get sequences that could be trusted. I checked some papers like this one:
Hi @Vanessa_Fernandes,
Information on different chemistries and primers will be important information to track along with your sequences. My recommendation would be to continue with trying to assemble this data, and keeping track of these variables (sequencing approach, DNA extraction approach, sequencing primers, …) as that information will be essential to have for some applications, but not so important for others. I would also recommend taking a look at the MIxS standards to get an idea of information that will be important to track as you build this resource.
Maybe I wasn’t clear, but I don’t want just to do a database, but also have a analysis of all that data together in order to create a tree. I need to be able to match sequences from different runs that utilized different primers and possibly chemistries. I’ve seen some of the post about merging datasets with different primers, however it didn’t appear to have a consistent answer. I saw some papers that did the same type of analysis I’m trying to do (like the one cited on top), but since I am a novice at this, I believe I need some guidance.
Hi @Vanessa_Fernandes, Thanks for the clarification. What you’re trying to do is challenging for a couple of reasons - I apologize in advance that I don’t have a simple answer for you. Qiita is designed for performing these kinds of analyses, so the easiest path to start on might be to upload your data to Qiita and perform a meta-analysis there.
There are two main challenges for this type of analysis with microbiome data:
First, different extraction chemistries and primers will lead to biases where certain organisms will be observed with some combinations of chemistry and primer pair but not with others. This may be because the extraction processes differ in their effectiveness for these organisms, and/or because the primers differ in how well they match (and therefore amplify) those organisms’ gene sequences. As far as I know, there is no way around this - you’ll just need to keep watch for issues - but it’s possible that batch correction methods like q2-perc-norm might be helpful for this (@cduvallet may be able to comment on whether this would be a reasonable application of q2-perc-norm).
Second, when using non-overlapping primer pairs (such as the 27F/338R and 515F/806R 16S primers) some other information will be needed to link sequences that come from the same organism. To date, that information has most commonly been full-length sequences (in which case you would use a closed-reference OTU clustering approach on each data set independently, and then merge the resulting feature tables). That information could alternatively be taxonomy assignments for your sequences (in which case you could generate feature tables and taxonomy assignments for each dataset independently, and then collapse at some taxonomic level and merge the collapsed feature tables), or a phylogenetic tree (in which case you could generate feature tables and taxonomy assignments independently for each data set again, and use q2-fragment-insertion to insert your observed sequence variants into a reference tree, and then merge your feature tables - @Stefan might be able to comment on whether this is a reasonable approach). Each of these approaches has its pros and cons.
Again, I would probably think about starting with Qiita if you’re new to this - if nothing else, it could give you some results to sanity check other approaches against.
Let me know if you have more questions about this and I’ll try to help out.
q2-perc-norm is probably not the right batch correction method for this application, since it requires that (1) you are trying to compare abundances between samples in different groups within each dataset and (2) you have some subset of samples that are comparable across all datasets. It won't help you solve the problem of needing to make a tree, etc.
If you are trying to do some sort of differential abundance analysis, though, happy to chat more and see if q2-perc-norm is appropriate!
Also CCing @seangibbons in case he has anything else to add.
I agree with Claire. q2-perc-norm is not the solution here. Sounds like you should use some form of closed reference feature selection, as Greg stated. Alternatively, you could use a method like dada2, call taxonomy, and then collapse the features to the genus level - has worked well for us in the past. After that, you should probably run your various analyses within each batch, and then compare your analysis results across batches.
regarding q2-fragment-insertion: sequences are inserted at the correct position of a given reference phylogeny (default is Greengenes) such that this phylogeny is extended by novel sequences. Assume you sequence a taxon with two different primers / regions from two environments how would you know that they belong to the same taxon? Typically, those sequences don’t overlap. You could either map via closed reference, as Greg mentioned, to OTU-IDs or use q2-fragment-insertion to place into reference phylogeny. But in your case that would yield the same results unless you invent fancy methods to decide for any two placed sequences if they belong to the same taxon. I don’t see how to do that :-/
Thus, my recommendation here is to start with traditional closed reference picking.