Integrate Sanger sequences and Illumina run

nora · January 25, 2021, 7:26pm

Dear community,

I find myself in doubt on how to proceed to analyze a dataset of mine, and need a sanity check before proceeding.
From the same biological samples, we obtained

16S amplicon sequencing with Illumina MiSeq with 515F, 806R primers
complete 16S Sanger sequences of a few strains we isolated from the whole community after plating

We therefore have a double aim:

To characterize the whole community
To identify the species we could isolate (and that we Sanger sequenced) in the Illumina batch, so to see which sample contain them and in which proportion.

My idea would therefore be:

to classify the Sanger sequences using either blast or the pre-trained whole length Greengenes 13_8 99% OTUs full-length sequences;
restrict the Sanger sequences to the section amplified in the Illumina run, and build a home-made classifier for that
Divide my Illumina run sequences by alignment into two groups, using exclude-seqs, to divide them between those who align on the Sanger sequences (group A), and those who don't (group B)
group A will use the home-made classifier to be assigned to the corresponding Sanger sequence
group B will use the classic Greengenes 13_8 99% OTUs from 515F/806R region of sequences classifier

I should then be able to merge features tables and taxonomy for downstream analysis; but I will have to build one unique phylogeny with all the rep-seq using align-to-tree-mafft-fasttre.

Does this workflow have major flaws that I am not seeing? Is there an easier way to extract which ASVs in the FeatureTable correspond to the species seen via Sanger sequencing?

Thanks for the support, greatly appreciated.

Kind regards,
Eleonora

llenzi · January 27, 2021, 3:12pm

Hi @nora,

I wonder if you would be able to avoid the splitting of the sequences into two groups. The step I would use are:

Classify your Sanger sequences, to check if any other similar species is already in GG.
Add your sanger sequences (and their taxonomy) to GreenGenes, maybe using some artificial species names for your sanger sequences, so they are easily traceable in the final taxonomy plots. If in GG there are already some species similar to your, you may want to add your sanger sequences before GG sequences, so you will have 'Sanger Seqs + GG' in this order. If GG lacks of any sequences similar to your, the order is less important.
Having your sequences at the beginning, will help you at the taxonomic assigning step. If you use 'qiime feature-classifier classify-consensus-blast' specifying '--p-maxaccepts 1', when a representative sequence will hit one of your sanger sequences at the beginning, blast+ will stop the search and output this sanger sequence as best match.
Visualise the taxonomy to search for your 'artificial species', or you could you 'qiime taxa filter-(table''filter-table: Taxonomy-based feature table filter. — QIIME 2 2020.11.1 documentation) to create a new abundance table including only your artificial species.

Hope it make sense