I find myself in doubt on how to proceed to analyze a dataset of mine, and need a sanity check before proceeding.
From the same biological samples, we obtained
- 16S amplicon sequencing with Illumina MiSeq with 515F, 806R primers
- complete 16S Sanger sequences of a few strains we isolated from the whole community after plating
We therefore have a double aim:
- To characterize the whole community
- To identify the species we could isolate (and that we Sanger sequenced) in the Illumina batch, so to see which sample contain them and in which proportion.
My idea would therefore be:
- to classify the Sanger sequences using either blast or the pre-trained whole length Greengenes 13_8 99% OTUs full-length sequences;
- restrict the Sanger sequences to the section amplified in the Illumina run, and build a home-made classifier for that
- Divide my Illumina run sequences by alignment into two groups, using
exclude-seqs, to divide them between those who align on the Sanger sequences (group A), and those who don’t (group B)
- group A will use the home-made classifier to be assigned to the corresponding Sanger sequence
- group B will use the classic Greengenes 13_8 99% OTUs from 515F/806R region of sequences classifier
I should then be able to merge features tables and taxonomy for downstream analysis; but I will have to build one unique phylogeny with all the rep-seq using
Does this workflow have major flaws that I am not seeing? Is there an easier way to extract which ASVs in the FeatureTable correspond to the species seen via Sanger sequencing?
Thanks for the support, greatly appreciated.