Sanger sequences as a database to look for contaminations

Hi everyone,

First of all , congratulations on the tool and its environment. It is amazing how it works and how everything gets streamlined with all the tutorials and the support in the forums.

I apologize in case there is a similar topic, but after a few days looking, I could not find anything similar (perhaps you will lead me in other directions also).

We work with axenic D. melanogaster and we inoculated them with 4 specific Acetobacter, Leuconostoc and Lactobacillus strains that were sanger sequenced. Flies were fed with different diets and we wanted to track how the bacterial composition changes over different generations.

Nevertheless it seems a contamination might have occurred on the way as some Gluconobacter are detected after a few generations. We also want to see if there is contamination of other Acetobacter as they are frequently found in D. melanogaster.

In our case, we are amplifiying V3-V4 regions, and as I understand, it is quite complicated to go down to the species level with such short fragment. Also, given my computational power limitations, I use SILVA or NCBI databases with blast consensus. A consensus that seems quite complicated to reach for Acetobacter, and as I understand, doing a top blast hit is not correct.

As a result, we can only see a lot of Acetobacter, but not down to the species level, so we cannot see whether there is contamination with other Acetobacter.

My supervisor suggested me to compare the 16s ilumina reads with the initial Sanger sequences, and to be honest I have no idea how to do it! (I am new with bioinfo)

I though that I could generate a database using the sanger sequences I have and use qiime. The incomplete classification of part of them will mean there is additional contamination with other strains not belonging to our 4 original strains. Is this feasible? If its, where do I start from to construct the seq and taxonomy files?

If this approach is incorrect, could you guide me to a more realistic strategy?

Thank you very much


Hello Jaime,

Good morning, and welcome to the forums!

This sounds like a cool study. A few years ago I worked on a germ-free mouse model that sounds similar. :mosquito: + :microbe:
(Can you believe there's an emoji for mosquito but not Drosophila... :crying_cat_face:)

We had similar concerns with our gnotobiotic hosts. I like your idea of tracking it back to common host microbes using amplicon sequencing.

Not complicated, it just might not be possible.

The core issue is that the V3-V4 amplicons are shorter than the full 16S region.

Full length 16S, Sanger:      |----------------|
V3-V4 16S amplicon, Illumina:      >-----<

Because there are fewer base pairs in the short amplicon, there is less resolution to tell microbes appart. There's no way to Inhance! the data to get the same resolution as the full length sequence.

Sometimes closely related taxa have enough differences in their V3-V4 amplicon that you can get all the way down to species or even strain level! Sometimes very different taxa have 100% identical V3-V4 regions. :man_shrugging:

You could align some of your most common V3-V4 reads against your full length reference using vsearch or an online tool like MAFFT. This will let you compare how similar the reads are in the area of overlap, and also see the areas of your 16S gene not covered by the V3-V4 amplicon.

I think the simplest way to start looking for contamination is to just work with your amplicons to see what you can find, and incorporate full-length Sanger later in the process. You might find you have enough resolution to identify and trace multiple strains of Acetobacter with just your amplicons!

You have probably found these already, but the FMT and Parkinson’s Mouse Tutorial provide a great starting point for this sort of analysis within Qiime 2.

Let us know if you have any questions!

1 Like

Hi Colin,

Thanks for your answer.

I will sure try the MAFFT strategy. Perhaps it will indicate us that using a different set of primers could improve our taxonomy classification.

A few hours ago, I found this thread where @llenzi suggests to add the sanger sequences manually into the greengenes database, creatinbg a taxonomy with a distinct name so it is easy to spot after taxonomical classification.

Given that I have several colonies with sanger sequences for each of my individual species, would it be possible to generate a series of consensus seqs and then introduce them manually into the data base as suggested in that thread?

So far I have been following the moving pictures tutorial, but I have to admit that the clawback and taxonomic weights from the tutorial you suggested is really interesting.

1 Like

Hello Jaime,

That thread is a great find. That sounds just like your data set, and I really like Luca's suggestion about how to add Sanger regions to an existing database. This step is important to mention: "restrict the Sanger sequences to the section amplified in the Illumina run"

Yes! When you are ready to try database building, make sure to check out the RESCRIPt plugin

Depending on your timeline and scope of work for this project, it might be easier/faster to present the amplicon analysis and Sanger sequences as two complementary ways to measure the host microbes. Seamless integration might be a big lift, but you can still present them side by side in the context of a paper.

1 Like

Hi again Colin,

Apologies for the delay. This is a side project and not part of my main research, so I come back to it whenever I have a gap in my schedule.

My idea is to proceed as follows:

1-Get the fastas from the Sanger sequencing (several colonies were tested for each member) and import them into quiime2.
2-With Dada2, generate a set of rep-seqs for each of the strains. This step will also trim the reads to get rid of the primers.
3-Manually introduce the rep-seqs and taxonomy into the database.
4-Using rescript, limit the sequences in the whole database (including the newly manually added rep-seqs) to the Ilumina amplified regions. In my case, I will probably use the NCBI database, as GG is not very updated and SILVA melts the computer. I assume that for this porpoise I can follow the tutorial here.

Is this a good plan to proceed? Might I be missing something?

Kid regards


Good morning Jaime,

Yes, that sounds like a good plan!

After you build a database of Sanger reads, what's your step 5?

Every time I try to build a database, it sort of takes over my life, so be careful! :scream_cat:

1 Like

Hi Colin,

Once I build the database, I plan to use consensus blast (again my laptop is limited). I am planning to use 95-99% similarity, to force the identification of sequences similar to the ones I introduced in the database. Those not identified as such, should be contaminants from the same family or genus.

I hope this is what you ment by step 5!


1 Like