does qiime downstream of feature-classifier need Evalue and bitscore?

splaisan · October 29, 2019, 11:33am

I am wondering whether qiime steps after the classifier use the blastn 6 format Evalue and bitscore columns in any way (alignment quality filtering?!).
When it does not, it may be possible to substitute blastn/vsearch by minimap2 for long read assignments but minimap does not produce Evalues (bitscores) so if these are required, it does not qualify.
Thanks for your info

colinbrislawn · October 29, 2019, 1:51pm

Good to see you again Stephane,

Let's open up the source code and see what that plugin is doing under the hood. ->

It looks like after running vsearch and blast, the _consensus_assignments method is called.

github.com

qiime2/q2-feature-classifier/blob/dev/q2_feature_classifier/_consensus_assignment.py#L21


      
          
          import pandas as pd
          
          from qiime2.plugin import Str, Float, Range
          from .plugin_setup import plugin
          from q2_types.feature_data import FeatureData, Taxonomy, BLAST6
          
          
          min_consensus_param = {'min_consensus': Float % Range(
              0.5, 1.0, inclusive_end=True, inclusive_start=False)}
          
          min_consensus_param_description = {
              'min_consensus': 'Minimum fraction of assignments must match top '
                               'hit to be accepted as consensus assignment.'}
          
          DEFAULTUNASSIGNABLELABEL = "Unassigned"
          
          
          def find_consensus_annotation(search_results: pd.DataFrame,
                                        reference_taxonomy: pd.Series,
                                        min_consensus: int = 0.51,

That in turn calls _compute_consensus_annotations, which is further down the same file.

github.com

qiime2/q2-feature-classifier/blob/dev/q2_feature_classifier/_consensus_assignment.py#L135


      
              taxa_hits = taxa_hits.groupby(taxa_hits.index).apply(list)
          
              return taxa_hits
          
          
          def _compute_consensus_annotations(
                  query_annotations, min_consensus,
                  unassignable_label=DEFAULTUNASSIGNABLELABEL):
              """
                  Parameters
                  ----------
                  query_annotations : pd.Series of lists
                      Indices are query identifiers, and values are lists of all
                      taxonomic annotations associated with that identifier.
                  Returns
                  -------
                  pd.DataFrame
                      Indices are query identifiers, and values are the consensus of the
                      input taxonomic annotations, and the consensus score.
              """
              # define function to apply to each list of taxa hits

Here, I don't see any 'alignment quality filtering', but keep in mind that both vsearch and blast let you define filters to control what they consider as a hit.

def classify_consensus_vsearch(query: DNAFASTAFormat,
                               reference_reads: DNAFASTAFormat,
                               reference_taxonomy: pd.Series,
                               maxaccepts: int = 10,
                               perc_identity: float = 0.8,
                               query_cov: float = 0.8,
                               strand: str = 'both',
                               min_consensus: float = 0.51,
                               unassignable_label: str =
                               _get_default_unassignable_label(),
                               search_exact: bool = False,
                               top_hits_only: bool = False,
                               threads: str = 1) -> pd.DataFrame:

You can see that vsearch will not report a hit unless it's 80% similar and has 80% coverage. I think this is what you were asking about.

Keep in mind that vsearch does not report evalues either, so this should not conflict with minimap2.

Colin

P.S. I feel a little pedantic saying this, but e-values are different than bitscores. Bitscore are a weighted total of the matches and mismatches in the alignment. E-values are the chances of seeing a bitscore that good by chance alone in the database. E-values are probabilistic based on the size and complexity of the database and also the size and complexity of the query. I think this is why modern programs often don't report e-values.

splaisan · October 29, 2019, 3:26pm

Thanks a lot Colin,

So when I succeed to derive a blastn 6 quasi format from the minimap2 PAF format, I may plug minimap into qiime provided I can somehow build a stringency filter to keep only long matches (80%) out of the mapping results. Well, this is still some way to go but I'll try.

colinbrislawn · October 29, 2019, 4:38pm

PAF is a lot like blast6. Here's all the PAF columns:
https://lh3.github.io/minimap2/minimap2.html#10

I wanted to check in about your project a little more. I've found vsearch plenty fast, even at a very large scale. How many reads are you looking to assign taxonomy to? Have you reduced their number by dereplication, denoising, or high-identity clustering?

Colin