Hi Mike and thanks for this very helpful tutorial. I'm interested in using the GTDB because relative to some of the other options it contains taxonomy to the species level. While I'm aware that there are some challenges in properly classifying to species level, I'd at least like to get an estimate.
I followed the steps here and then attempted classification using the following commands:
qiime tools import --type 'FeatureData[Sequence]' --input-path input.fasta --output-path test1.qza
qiime feature-classifier classify-sklearn --i-classifier gtdb-214-both-classifier.qza --i-reads test1.qza --o-classification test_tax.qza
qiime metadata tabulate --m-input-file test_tax.qza --o-visualization test_tax.qzv
While it successfully identified the input species to the genus level, I don't see any information about classification of species. Do you have any insight about something I may be missing, or is this an issue outside of my control?
Likely outside your control. Assuming you are trying to classify short / non-full-length sequences, this is a common issue. In fact, being unable to disambiguate genera and species from one another, even using full length 16S rRNA gene sequences, is not uncommon.
Got it, thanks for the quick reply. We're working with ~1500bp sequences targeting the 16S region as part of an assay to identify pathogens in unknown samples. Our old assay had relied on both NCBI and the RDP database which is now discontinued.
I don't suppose there are any ways to modify the classifier to output matches to the top 10-20 matches to lower (species) levels, even if we aren't confident in distinguishing between them? Some of our results are reported out as species complexes or multiple possible species based on something like this along with subject matter expertise after the calls are made.
RDP has not been discontinued, but is now hosted elsewhere. It is at least available in source-forge. In fact, you can use the RESCRIPt plugin to fetch the latest version of RDP following this tutorial.
You can also follow our SSU tutorial for GenBank.
Assuming the rank designations are similar, and they do not differ too much with their taxonomic schema, you should be able to merge the taxonomy and sequence files together, then make a classifier. I've not tried this before, so I cant comment on how well it'd work.
The closest thing I can currently think of, you can try is the
feature-classifier classify-consensus-vsearch with the
--p-top-hits-only flag. I am sure someone else might be able to provide a better or more creative solution.
You can also consider using the weighted classifiers constructed through q2-clawback.
Thanks again. My understanding was that there was a discontinuation of funding for RDP and that it is no longer being curated, but perhaps I'm misinformed. If it is just the server that has been discontinued, we can continue to use it (we're currently querying a static copy of the latest version of the database, but were operating under the assumption that this is the "final" version and we'll need to move on from it eventually).
I'll look in to your suggestions here. Appreciate your taking the time to respond again.
I was unaware of of discontinued funding. It explains why their main website has been offline for a while.
I corrected my "discontinued statement" in my earlier reply.