BLAST parameters

Luciano_Pastorelli_B · December 7, 2023, 2:36pm

Hi! I think the next should important parameters to take in consideration in the feature-classifier classify-consensus-blast, as also being part of BLAST+:

Total/Max score: score of the alignment based on
megablast/blastn

Total/Max score and Query cover should also be important parameters when classifiying a species (e.g., a 98% identity and 100% query cover sequence should ideally have more weight on species consensus than a sequence with 99.5% and 75% cover; the same applies when having a higher total/max score), idk if it is being considered.

Thanks for the attention!

colinbrislawn · December 7, 2023, 3:03pm

Hi Luciano,

Welcome to the forums! :qiime2:

Have you taken a look at all the settings for the blast+ classifier?
--p-query-cov is already implemented, but I'm not sure that's what you are asking about...
https://docs.qiime2.org/2023.9/plugins/available/feature-classifier/classify-consensus-blast/

Is this like percent identity, but for bit score? Has this been published in a paper?

Luciano_Pastorelli_B · December 7, 2023, 3:22pm

Hi, Colin!

Yes, Ive noticed. That parameter filters out sequences w/ less query cover than the specified, but I dont know if, the taxonomic assignment is taken in consideration that is more confident if you have a high query cover than a lower one. This is not specified in the Docstring, but maybe taken into account interrnaly in the function, idk.

Total/Max score are results when using BLAST+. This score is based on rewards/penalties for matches/mismatches between bps.

colinbrislawn · December 7, 2023, 3:40pm

Do you have any benchmarks to share?

Luciano_Pastorelli_B · December 8, 2023, 6:07am

Nicholas_Bokulich · December 8, 2023, 7:45pm

Hi @Luciano_Pastorelli_B ,

Thank you for the suggestion.

No, it is not. But nor is %id or evalue etc used in this way. In classify-consensus-blast, all of these parameters are used as thresholds for filtering relevant hits. So these should be set at reasonable threshold values used for filtering hits. This is also, by the way, similar to how blastn itself operates (depending on how the max-accepts and max-rejects parameters are set)... it just searches through the reference and collects all hits above threshold values for these parameters and scores them until it reaches the defined quota. The difference with classify-consensus-blast is that it then uses a consensus function to find the most confident lineage from among a selection of hits.

So I like your proposal for using these parameters, but you are in effect proposing a much more complex decision function for selecting hits, and implementing this would not be simple or straightforward and will take time and testing. This is what @colinbrislawn is getting at with his request for benchmarks — to see if there is any such decision function implemented elsewhere and if this actually shows adequate performance. If yes, then implementation would be easier

If you would like to open a more detailed issue request on the GitHub repository we could take a closer look at this later in time.

Thanks for the suggestions!

Luciano_Pastorelli_B · December 8, 2023, 8:30pm

Great! Thank you both of you I have seen some approaches using these values to identify species by barcoding and metabarcoding. It would be great to open a discussion in github later with more details provided!