I need a complete species-level classification for my research and any advice would be appreciated.
I've analyzed 16s amplicon sequencing data from human stool samples using Qiime2 (v. 2022.02).
First, the taxonomic classification was conducted using qiime feature-classifier classify-sklearn (silva 138), and most feature IDs were assigned at the genus or family level or higher.
In order to increase the assigning rate at species level, I applied a weighted taxonomic classifier (515f-806r-human-stool-classifier.qza) and things got much better.
I also tried classify-consensus-blast, but the result was not what I expected.
Anyway, some feature IDs (more than one third) were still not assigned at species level.
I know when I click on a sequence in rep-seqs.qzv, it is blasted automatically in NCBI. But the database is based on nt, not 16s rRNA curated database.
So, I just NCBI blasted the sequences obtained from rep-seq.qzv using 16s rRNA sequences database (curated), and I got species level taxonomy.
My question is:
Q1) Do BLAST searches of the representative sequences make sense?
Also, can I use corrected classification data based on 16s currated blast results for further analysis?
(e.g. can I change the classication for the feature ID shown in the exemple above to
d__Bacteria; p__Firmicutes; c__Clostridia; o__Lachnospirales; f__Lachnospiraceae; g__Blautia; s__Blautia_wexlerae ?)
Q2) If possible, could you recommend any method to solve the problem?
It would be very laborous to blast the massive unassigned sequences one by one, and I really don't feel up to it.
Hi @Soyoung_Yeo,
With 16S sequencing it's not always possible to get species-level resolution. The resolution of 16S amplicon sequencing is typically considered reliable at the family or genus level, and species level assignment is sometimes obtainable. This limitation is inherent in the approach - specifically there is not always variation in the short amplicon sequence at the species level (e.g., all members of a genus often have the same 16S sequence for the short fragments that we sequence).
That said, one thing you can do to try to improve resolution of your classification is to use environment-weighted taxonomy classifiers. These are discussed in this paper, and you can find a tutorial here. Note that the readytowear project provides weights that can be used to train classifiers for different environment. That is the approach that I would recommend in response to your Q2. This approach doesn't get around the limited information in the sequences, but includes additional externally derived information about what organisms are most likely to be found in the environment that you're working in.
In response to your Q1, the species-level assignments that you're getting from BLAST against NCBI are not reliable species-level classifications. Those are showing the closest matches in the NCBI database that you're using, but since that search isn't designed for assigning taxonomy to amplicon sequences it isn't going to give you partial assignments with associated confidence scores at different taxonomic levels. In the BLAST results that you shared, this is illustrated by the fact that there are two nearly identical quality matches (the first and the third matches) that are associated with different Blautia species. The way to interpret that is that you can have confidence in the Blautia (genus) assignment, but the sequence is ambiguous at the species level.
Thank your for your advice. It's been a big help in solving my concerns.
By comparing the results of various classifiers or methods in my study, I found that the environment-weighted taxonomic classifier had the highest species-level resolution.
Once again, thank you very much.