Comparing amplicon studies that use different primers

Hey Qiime2 community!

I am currently trying to compare samples from several papers and they all contain data from different primers. Some of the primers cross over or contain a similar region and others do not. I have found these forum topics that have been very helpful:

I seem to be able to follow the cutadapt trimming method just fine. I have compared relative abundance of some known samples before and after being trimmed (compare samples before trim with a 341F-806R and after trim with a 515F-806R) and the relative abundance comparisons look okay. Although this only works when I can trim the samples because they are in the same variable region. I still can't always compare all my samples for example I can't compare samples from primer 515F-806R and 751F-1204R.

I am therefore hoping to make the q2-fragment-insertion approach work but am a little confused on the files I need for this. Will I merge all of my sequence files even if they are from different primers? Or do I upload them seperately? I have not found a great tutorial for this and if there is one please let me know. I know I can merge them with

qiime feature-table merge-seqs

But I am not sure if I should just merge seq files that don't have same regions. If I merge them could I just use that merged seqs file as the rep-seqs.qza in these commands?

qiime phylogeny align-to-tree-mafft-fasttree
--i-sequences rep-seqs.qza
--o-alignment aligned-rep-seqs.qza
--o-masked-alignment masked-aligned-rep-seqs.qza
--o-tree unrooted-tree.qza
--o-rooted-tree rooted-tree.qza

qiime fragment-insertion sepp
--i-representative-sequences rep-seqs.qza
--i-reference-alignment aligned-rep-seqs.qza
--i-reference-phylogeny rooted_tree.qza
--o-tree insertion-tree.qza
--o-placements insertion-placements.qza

qiime fragment-insertion classify-otus-experimental
--i-representative-sequences rep-seqs.qza
--i-tree insertion-tree.qza
--i-reference-taxonomy ref-taxonomy.qza
--o-classification taxonomy.qza

Or how can I compare all of the sequences from different amplified regions? Also where do I get the ref-taxonomy.qza file?

Any information would be greatly appreciated. Thanks!
Hannah

1 Like

Hi @hdoris!

Yeah this is a general problem with meta-analysis.

I would start with figuring out what you want your feature to be defined as.

It is common to do one of two things:

  1. slice to shared primer region (which you can and have done for a few of your samples), in which case you simply proceed as you would normally once you've cut everything to the same region.
  2. Use taxonomic features at a rank where you feel comfortable hand-waving the difference in sensitivity and specificity of your assignments from different primer regions (this will never be perfect, but it's pretty common and generally accepted as on of the few ways to work around the issue).

It also sounds like you are thinking about using phylogenetic placement to resolve this issue. I think that's a pretty good idea, although I believe this relies on the SEPP tree to be the taxonomic database you are interested in. (This is incidentally what the reference_taxonomy is for classify-otus-expirmental. It's the same taxonomic file you would use to train a scikit-learn classifier in feature-classifier and should match the sepp databse you are using for placements.)

I think the main issue is going to be getting the right sepp reference database. @cherman2 has more experience with that than I do.

It would also be possible to simply classify all of your individual sequences with your favorite taxonomic procedure (use a full length 16S database so you don't accidentally miss a primer region). And then merge the resulting tables as you expect before finally using taxa collapse to put them all in a "hand-wave-ably equivalent" feature space.

I hope that gets to the majority of your question!

2 Likes

@ebolyen Thank you for the response.

I think that there are two possible options that I might use (either phylogenetic placement or a little hand wavy taxa collapse). I have a few more questions for both in order to move forward. I have been reading a few forums about the different ways to develop SEPP reference databases. Specifically these ones have been the most helpful so far:

One question I have is in one of the forums it talks about trimming the reference alignment while others say you should not. I would ideally use the Silva138 reference (either 138.1 or 138.2). If I need to trim it that would not be ideal becuase I would want the entire 16S reference file since my different amplicons are from different regions. I am assuming therefore it is okay to not trim the reference file?

I am also wondering how I would create one seq file for all of my different sequences. If I were to run the following commands how would I create one rep-seq.qza file from all my different sequences from different amplified regions:

qiime fragment-insertion sepp
--i-representative-sequences rep-seqs.qza
--i-reference-alignment aligned-rep-seqs.qza
--i-reference-phylogeny rooted_tree.qza
--o-tree insertion-tree.qza
--o-placements insertion-placements.qza

qiime fragment-insertion classify-otus-experimental
--i-representative-sequences rep-seqs.qza
--i-tree insertion-tree.qza
--i-reference-taxonomy ref-taxonomy.qza
--o-classification taxonomy.qza

For taxa collapse I am wondering if there is a way to keep the correlating ASV labels that you would lose after collapsing the taxa file?

Thanks!

Agreed, I would not trim either, otherwise what is the point of the approach?

Sorry for missing that question the first time around. Just use feature-table merge-seqs (and remain suspicious of it should you use it further downstream, since they are mixed features).

Unfortunately, no, but we've talked about having an OTU map for some time. I believe the type exists now for the metagenomics distro, but I do not think anyone has added it to the various collapse-y actions in amplicon.

Just to add on to what @ebolyen said,
I have found that creating your own SEPP Database is computationally expensive and not very well documented so I guess I am here to warn that this may be more complicated then it seems.

Additionaly, I think the current SEPP databases on our data resources page may have full length sequences. I think inserting your sequences may work, i'd give it a shot!

1 Like

The most up to date SEPP database is the 132 correct? That is found on silva? Here:

Or is it located somewhere else? I honestly haven't been able to find the best tutorial of a process like this so I will try with the provided 132 database to start off and go from there. Thanks

Hannah

1 Like

That looks right to me.

I will mention we have the following pre-baked sepp databases (scroll to the bottom of this page):
https://docs.qiime2.org/2024.10/data-resources/#sepp-reference-databases

As @cherman2 suggested, it might be worth trying that first to see how you like the approach before you start the involved task of generating your own.

1 Like