Sequencing dataset with four different primers- Should I merge the datasets?

Hi all,

I’d like to pose a question to all of you about how to handle my sequencing data.
Background: I sequenced 400 soil samples. In the library prep, I used 4 primer sets, 1 primer set for bacteria, 1 for fungi, 1 for protists and 1 for metazoa. I first ran four different PCRs on each sample. The products of the amplification were pooled together by sample. Here, I did a second PCR to attach barcodes representative of each sample. After the barcodes were attached, I pooled together all the samples and sent them for sequencing ( MiSeq 2x300bp).

When I got the demultiplexed sequences back, I first divided them into organismal groups. So based on the primer used in the first PCR (locus specific), I separated the reads in bacterial, fungal , protists and metazoa datasets. At this point, I ran QIIME2 separately on the four different datasets.

For the taxonomy assignment, I trained Silva 138 in three different ways: 1 for bacteria, 1 for fungi and 1 for metazoa. For protists I trained PR2.

Once assigned the taxonomy, I could see that all ‘locus-specific’ eukaryotic primers allowed to identify also organisms from other groups (ex. among the fungal dataset I also got many protists and some metazoa, and the same happened in the metazoa dataset where I got many protists, and in the protists dataset where I got many fungi). Although this wasn’t a surprise, I was wondering what to do with those groups assigned in the ‘wrong’ database.
As a first thing, for each dataset I filtered out the ASVs that weren’t assigned to the group of interest (ex. from the fungal dataset I discarded all protists and metazoa ASVs, and I only kept the fungal ASVs)

But my question is: would it be a waste to just discard those ASVs? wouldn’t it be possible to merge, for example, the protists found in the fungal dataset, to the protists in the protist dataset?
To do this I would need a way to merge the tables after the taxonomy assignment.

But now, my second question would be: does it make sense to merge taxa that come from different taxomy assignment methods? Because for fungi and metazoa I used the same silva database but trained with different settings. For protists, I even used a different database for the taxonomy assignment. So I’m afraid that merging the data (if even possible) isn’t a good idea.

Thank you in advance for your support! I hope I was clear enough in explaining.

1 Like

Hello Sara,

Welcome to the forums! :wave:

(Before we dive in: Thank you for that excellent summary of methods! There are a lot of ways to do a multi-phylum amplicon study and your description was crystal clear. :sparkles:)

This is a really good question, and I’m not sure there’s a perfect answer. I can share what I have done on past studies, and discuss why the other options are really hard.

Let’s dive in!

We observed the same ‘off target’ effect with our 16S and 18S primers while working on this paper, and had exactly the same question. In the end, we chose to only analyze the taxa intentionally targeted by each primer, even though we used the same Silva database to classify all our amplicons.
subset_taxa(full16s, Rank1 == "D_0__Bacteria")
subset_taxa(full18s, Rank1 == "D_0__Eukaryota")


If reviewer #3 asked about the data we ‘threw away,’ we were ready to argue that our primers were only designed to target specific taxa. For example, the composition of protists found by our 16S primers would be less representative compared to our 18S primers, so why use bad data?

We processed everything separately. We only ‘mixed’ 16S and 18S taxa in our figures.
During review, no one even asked.
(Maybe everyone does this, idk :man_shrugging:)

It’s not a bad idea! Especially if some of your taxa are only observed by a primer that was not meant to detect them. (Did you observe this? That would be very interesting! :thinking:)

I would describe this as a challenging idea.

All the questions you asked outline the challenges ahead.

  1. consistent taxa names across databases

Qiime can do this, but only if the databases use the exact same name for all taxa within.

Silva should be consistent throughout! But I’m not sure if it’s consistent with PR2…

  1. trained database bias

I really like the idea of training each classifier on only the taxa of interest as it should improve quality for those taxa, but conversely I would expect it to reduce the quality for taxa it did not see during training.

  1. amplification bias

Given these challenges, I think you are 100% good to only present taxa targeted by your primers. :+1:


P.S. Sorry for the long answer.
P.P.S. Even if you choose to process your data separately, these question could still be explored!

For example, you could train a classifier using Silva bacteria + fungi + metazoa and see how the output taxonomy compares to the three classifiers trained separately. That would answer question #2. And if PR2 strives to be compatible with Silva, question #1 could be pretty easy.

If you choose to investigate further, let us know what you find! :female_detective: