Hello,
I am pretty new to bioinformatics and I recently started working on a pipeline for NGS 16S data anysis using qiime2 with a small "get-out, get back in" in that I'm using the Kraken classifier for the taxonomic assignment.
One of my coworker which is doing Sanger 16S was interested and asked if I could add an option for him to identify his sequences, I tried to assemble his 16S sequences (multiple primers to get overlapping sequences) and most of the time they assemble well into a single contig (i'm using cap3 for this task) and by feeding this sequence to a classifier I'm able to get a taxa name for the contig(s). However he also wants a phylogenetic tree like what blast provides.
So my question is basically, instead of using a classifier like Kraken, Centrifuge, SK-learn can I simply blast the sequence against a 16S database and build a phylogenetic tree using the top N results before looking at the taxonomic assignment of the closest sequence(s) in said tree (and return the parent in case of multiple taxa) ? This would give me both the taxonomic and phylogeny informations.
I feel like I'm missing something very important about the difference between Blast and taxonomy classifiers because I only remember people using classifiers to get the taxonomy and not blast, plus the majority of papers and tutorials I've read about 16S data analysis only do phylogenetic trees on the OTUs/ASV between themselves.
Before I try to answer your questions, I want to make sure we are using the same terms for the same things.
You mention 16S data analysis, but also the Kraken taxonomy classifier. Kraken, Kraken2, and Kraken Unique are all built to annotate shotgun reads, untarged DNA from anywhere in the microbial genome. But 16S reads are targeted and amplified from a single region, and different databases are used to classify them.
Do you have 16S amplicons or untarged shotgun data?
Are you sequencing a complex microbiome with 100s or 1000s of microbes, or an axenic isolate with one type of microbe? The use of Sanger sequencing and blasting a single contig seem like you are analyzing an isolate. In this case, you could absolutely use NCBI Blast to see the phylogeny around your single microbe.
This makes sense when you have 100s or 1000s of ASVs and you are interested in changes in phylogenetic composition between groups. And you can't have composition with just 1 microbe.
If you just want to place you isolate in a tree, blast would work well. In Qiime 2, you could try out the fragment-insertion plugin, which includes methods for taxonomy classification and also placing a read in an existing tree!
The use of Sanger sequencing and blasting a single contig seem like you are analyzing an isolate. In this case, you could absolutely use NCBI Blast to see the phylogeny around your single microbe.
Yes, for this particular Sanger case I'm analyzing isolates, I suppose I can just build a new pipeline using blast for this particular case then. Thanks a lot !
Kraken, Kraken2, and Kraken Unique are all built to annotate shotgun reads
Well, this raises a new question. My main pipeline is indeed designed for 16S amplicon and because I'm using an in-house 16S database which was built for Kraken2 I went with this classifier. So far the results were just fine.
I've tried to use the innate classifier sk-learn when going through the qiime2 tutorials and I found it really slow and ressources hungry compared to Kraken2 or centrifuge (which I tried too). I guess I should give it another try after finding out if and how I can build my own database for it.
I think I made a mistake about Kraken. I've always used it for untargeted DNA, but it looks like it the devs do suggest using it for amplicon databases, including GreenGenes and Silva.
Instead of "get-out, get back in", why not write a QIIME 2 plugin to do the job for you, or even add a kraken or centrifuge classifier to q2-feature-classifier? The advantages would be:
More streamlined for you (no exporting, re-importing)
Provenance will be preserved! So you can trace what was done prior to re-importing after classification, instead of leaving it up to the imagination.
Others will be able to use your methods! Which I think would be pretty great.
The main reason to add this as a method to q2-feature-classifier instead of making your own plugin is that it is less work overall, and would come pre-installed with QIIME 2 releases with q2-feature-classifier.
Let us know if you are interested and we can help you get started.
I do like the idea and already thought of tweaking the dada2 one to have access to other features the R package offers, I'll consider it after finishing my pipeline and validating the results.