The parameter perc_identity for classify_consensus_vsearch affects runtime?

Maurice_Barrett · May 15, 2024, 1:05pm

Hi,

I am using classify_consensus_vsearch to classify 16S reads from ONT. I notice that the value given to perc_identity affects run time. My understanding of the tool is that this parameter is just to filter the results/alignments and should not have a significant burden on the runtime. I have given examples below between the difference of 80 and 90. Note the higher the number the longer the runtime. In this case there is a significant increase in runtime.

In [11]: # Start timer for the whole classifier
...: script_start_time = time.time()
...:
...:
...: # Perform taxonomic classification using VSEARCH-based consensus taxonomy classifier
...: taxonomy_classification = classify_consensus_vsearch(
...: query=rep_seq_artifact,
...: reference_reads=sequence_artifact,
...: reference_taxonomy=taxonomy_artifact,
...: threads=10,
...: perc_identity=0.80,
...: strand='plus'
...: )
...:
...:
...: # Record the time at end of script
...: script_end_time = time.time()
...: # calculate total run time and convert to minutes
...: processing_time_minutes = (script_end_time - script_start_time) / 60
...: print(f"Time for classifcation: {processing_time_minutes:.2f} minutes", flush=True)
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --usearch_global /tmp/qiime2/concertbio/data/3d3d5247-179d-4db4-96a1-035fbaefac55/data/dna-sequences.fasta --id 0.8 --query_cov 0.8 --strand plus --maxaccepts 10 --maxrejects 0 --db /tmp/qiime2/concertbio/data/4017e970-1395-4cda-bc53-215c5ecfdff5/data/dna-sequences.fasta --threads 10 --output_no_hits --blast6out /tmp/q2-BLAST6Format-l77qwapk

vsearch v2.22.1_linux_x86_64, 125.6GB RAM, 20 cores

Reading file /tmp/qiime2/concertbio/data/4017e970-1395-4cda-bc53-215c5ecfdff5/data/dna-sequences.fasta 100%
303412040 nt in 201059 seqs, min 1200, max 1800, avg 1509
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Searching 100%
Matching unique query sequences: 1966 of 1973 (99.65%)
Time for classifcation: 4.21 minutes

In [12]: # Start timer for the whole classifier
...: script_start_time = time.time()
...:
...:
...: # Perform taxonomic classification using VSEARCH-based consensus taxonomy classifier
...: taxonomy_classification = classify_consensus_vsearch(
...: query=rep_seq_artifact,
...: reference_reads=sequence_artifact,
...: reference_taxonomy=taxonomy_artifact,
...: threads=10,
...: perc_identity=0.90,
...: strand='plus'
...: )
...:
...:
...: # Record the time at end of script
...: script_end_time = time.time()
...: # calculate total run time and convert to minutes
...: processing_time_minutes = (script_end_time - script_start_time) / 60
...: print(f"Time for classifcation: {processing_time_minutes:.2f} minutes", flush=True)
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.

Command: vsearch --usearch_global /tmp/qiime2/concertbio/data/3d3d5247-179d-4db4-96a1-035fbaefac55/data/dna-sequences.fasta --id 0.9 --query_cov 0.8 --strand plus --maxaccepts 10 --maxrejects 0 --db /tmp/qiime2/concertbio/data/4017e970-1395-4cda-bc53-215c5ecfdff5/data/dna-sequences.fasta --threads 10 --output_no_hits --blast6out /tmp/q2-BLAST6Format-ku9dzd5a

vsearch v2.22.1_linux_x86_64, 125.6GB RAM, 20 cores

Reading file /tmp/qiime2/concertbio/data/4017e970-1395-4cda-bc53-215c5ecfdff5/data/dna-sequences.fasta 100%
303412040 nt in 201059 seqs, min 1200, max 1800, avg 1509
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Searching 100%
Matching unique query sequences: 1913 of 1973 (96.96%)
Time for classifcation: 42.24 minutes

Nicholas_Bokulich · May 15, 2024, 1:29pm

Hi @Maurice_Barrett ,

Thanks for reporting this — I was not aware of this behavior and we had not seen it before (with Illumina short reads) but it does not surprise me at all, esp. for long-read data. Let me explain...

That is correct, but nothing is ever that simple

VSEARCH operates by searching the database until a certain number of hits (or rejects) is reached. So if you have a low perc-identity, you will reach the number of hits in a shorter amount of time. If you have a high perc-identity, then vsearch will keep on performing alignments until you reach the maxaccepts or maxrejects threshold.

For short-read data this probably makes a negligible difference most of the time, as the alignment time is usually quite short. But for long reads the alignment step is much more expensive, and so the effect on runtime would be much more substantial.

So my advice (if you are trying to reduce runtime) is to adjust maxaccepts or maxrejects to find a suitable threshold for your use case.

Maurice_Barrett · May 15, 2024, 1:57pm

HI @Nicholas_Bokulich ,

Thanks for the speedy reply! I think you have given me a better understanding of the tool. So the VSEARCH alignment is not done on the entire reference database, but the kmer based sorting is? This is followed by the alignment of the database versus the query until certain parameters are satisfied, namely maxaccepts and perc_identity?

Thanks again for the help!

colinbrislawn · May 15, 2024, 3:24pm

The "termination_options" are based on those from usearch: termination options -maxaccepts and -maxrejects

That's right!

From the vsearch manual:

--maxaccepts positive integer

Maximum number of hits to accept before stopping the search. The default value is 1. This option works in pair with --maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If --maxaccepts is set to a higher value, more hits are accepted.

--maxrejects positive integer

Maximum number of non-matching target sequences to consider before stopping the search. The default value is 32. This option works in pair with --maxaccepts. The
search process sorts target sequences by decreasing number of k-mers they have in
common with the query sequence, using that information as a proxy for sequence similarity.
After pairwise alignments, if none of the first 32 examined target sequences pass
the acceptation criteria, the search process stops for that query (no hit). If --maxrejects is set to a higher value, more target sequences are considered. If --maxaccepts and --maxrejects are both set to 0, the complete database is searched.

Maurice_Barrett · May 15, 2024, 3:46pm

@colinbrislawn Thanks for clearing that up for me! Appreciate all the help and work from the QIIME2 team!

colinbrislawn · May 15, 2024, 5:09pm

github.com

qiime2/q2-feature-classifier/blob/dev/q2_feature_classifier/_vsearch.py#L25-L32


      
          DEFAULTMAXACCEPTS = 10
          DEFAULTPERCENTID = 0.8
          DEFAULTQUERYCOV = 0.8
          DEFAULTSTRAND = 'both'
          DEFAULTSEARCHEXACT = False
          DEFAULTTOPHITS = False
          DEFAULTMAXHITS = 'all'
          DEFAULTMAXREJECTS = 'all'

Here are the default settings used by the Qiime2 plugin when running vsearch!

It looks like we are using --maxaccepts 10 by default, which makes sense because it terminates the search when 10 good hits are found.

We still use --maxrejects 0, meaning the full database is searched if no hits are found. I can see why this conservative/optimistic setting is used, especially if it matches the blast settings.

It does lead to a worst-case of O(query x target) where every target is aligned against the full database.

In this case, I would argue that vsearch is the wrong tool for the job or there's been an error upstream...

Maurice_Barrett · May 16, 2024, 8:51am

@colinbrislawn Thanks! I will play around with the parameters as well as the database and the query sequences. Get a good Goldilocks!

Maurice_Barrett · May 16, 2024, 1:21pm

@Nicholas_Bokulich @colinbrislawn I am wondering if there is a way of ordering the database based on assumed presence/abundance of certain taxa due to sample type etc. Or would the kmer sort negate such efforts?

Nicholas_Bokulich · May 16, 2024, 2:06pm

Hi @Maurice_Barrett ,

Correct. The kmer sort orders sequences in order of kmer profile similarity so order in the db does not matter.

You might be interested in this approach for weighting taxa based on expected frequency/prior observations. See this paper and the q2-clawback plugin:
https://www.nature.com/articles/s41467-019-12669-6

system · June 16, 2024, 8:06pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.