Hi! I think the next should important parameters to take in consideration in the feature-classifier classify-consensus-blast, as also being part of BLAST+:
Total/Max score: score of the alignment based on
megablast/blastn
Total/Max score and Query cover should also be important parameters when classifiying a species (e.g., a 98% identity and 100% query cover sequence should ideally have more weight on species consensus than a sequence with 99.5% and 75% cover; the same applies when having a higher total/max score), idk if it is being considered.
Yes, Ive noticed. That parameter filters out sequences w/ less query cover than the specified, but I dont know if, the taxonomic assignment is taken in consideration that is more confident if you have a high query cover than a lower one. This is not specified in the Docstring, but maybe taken into account interrnaly in the function, idk.
Total/Max score are results when using BLAST+. This score is based on rewards/penalties for matches/mismatches between bps.
No, it is not. But nor is %id or evalue etc used in this way. In classify-consensus-blast, all of these parameters are used as thresholds for filtering relevant hits. So these should be set at reasonable threshold values used for filtering hits. This is also, by the way, similar to how blastn itself operates (depending on how the max-accepts and max-rejects parameters are set)... it just searches through the reference and collects all hits above threshold values for these parameters and scores them until it reaches the defined quota. The difference with classify-consensus-blast is that it then uses a consensus function to find the most confident lineage from among a selection of hits.
So I like your proposal for using these parameters, but you are in effect proposing a much more complex decision function for selecting hits, and implementing this would not be simple or straightforward and will take time and testing. This is what @colinbrislawn is getting at with his request for benchmarks — to see if there is any such decision function implemented elsewhere and if this actually shows adequate performance. If yes, then implementation would be easier
If you would like to open a more detailed issue request on the GitHub repository we could take a closer look at this later in time.
Great! Thank you both of you I have seen some approaches using these values to identify species by barcoding and metabarcoding. It would be great to open a discussion in github later with more details provided!