I'm working on a metabarcoding project involving the fungal genus Fusarium. We are using the TEF1 gene instead of ITS, as TEF1 is expected to provide higher taxonomic resolution of Fusarium species (PMC6329491). I trained a Naive Bayes classifier in QIIME 2 using Fusarium-ID as my reference database, which contains around 1k sequences from various Fusarium species, along with some outgroup sequences from other genera in the same family.
We designed specific primers for Fusarium and verified them using Primer-BLAST against the nt database. They seem to be highly specific. However, after running DADA2 and assigning taxonomy with a custom Naive Bayes classifier, I found that about half of my ASVs are only classified at the family level, while the rest are assigned to Fusarium, with no species assignation at all.
I manually checked some of the ASVs that were only classified at the family level by running a standard BLAST search. Surprisingly, they don’t seem to match Fusarium, even though they should, based on our primer validation. Although I trust the sequences in Fusarium-ID more than some of the sequences uploaded to GenBank, this raises some concerns.
My question is that, since Naive Bayes relies on k-mers rather than full-sequence alignment, do you think it would it be better to use classify-consensus-blast or classify-consensus-vsearch instead? Another options that come to my mind are either directly BLAST against the original database FASTA, or to manually add more sequences to the database e.g. from GenBank Nucleotide database.
Anyway, I feel that instead of looking for another option, I should focus on why the standard approach within QIIME 2 (i.e. Naive Bayes) is not working as expected here. Does anyone have any thoughts on this? Any help would be highly appreciated!
I would start here, personally. Because, first off, it sounds like you are getting unexpected results across the board, including with NCBI BLAST. So some re-optimization may be needed all around.
It is still worth a shot to test classify-consensus-blast or classify-consensus-vsearch, of course, but it seems like the issue here is just that you are breaking into new territory and the methods (optimized for 16S/ITS) may need to be re-optimized to work with a different target.
One place to start could also be adjusting the confidence parameter with the Naive Bayes classifier.
Also check to confirm read orientation (both queries and reference... some databases are not in a fixed orientation). This would cause havoc with the Naive Bayes classifier, which assumes a fixed orientation (the vsearch- and blast-based classifiers, on the other hand, can search both)
So don't give up, with some fiddling I think you can solve this!
Are there any out-groups / decoy sequences contained within your Fusarium classifier? If not, this is likely the reason why some of your BLAST results are not returning Fusarium, but the classifier is...
If the classifier only contains, or "only knows about", Fusarium, then any query searches would have to be a really bad match for a given sequence to be identified as "unclassified". That is, there is a high chance that your sequences will erroneously be classified as Fusarium, even though they are not.
This is a common issue with other amplicon targets. Often classifiers built using reference databases such as UNITE and SILVA, can be constructed with out-group taxa. For example, UNITE has the option to provide non-fungal eukaryote taxa, and SILVA contains Eukaryotic 16S & 18S sequences as out-groups for bacteria and achaea. This way you can remove those sequences from you data, if needed. See these threads for more detail:
I circle back to this only to thank you again for your comments because I finally managed to get the classifier working!
Following this advice I retrieved TEF1 sequences from NCBI Genbank (non-Fusarium fungi and other eukaryotes) in order to make a complete classifier. RESCRIPt was really helpful for that (download, cull seqs, etc). At first I did not get ASVs classified as Fusarium (not even classified as family Nectriaceae!), but one of the RESCRIPt tutorials outlines that amplicon-specific classifiers tend to be better (at least for 16S). Although this is not the case for ITS (at least based on my previous ITS metabarcoding experiments) I decided to cut using our primers.
Following this advice, I searched in both orientations. My final classifier worked like a charm! I am able to identify Fusarium at the species level with reasonable confidence values¹.
Sergio
--
¹ I need further manual validation, but still – results look so good