Hi,
I am using classify_consensus_vsearch to classify 16S reads from ONT. I notice that the value given to perc_identity affects run time. My understanding of the tool is that this parameter is just to filter the results/alignments and should not have a significant burden on the runtime. I have given examples below between the difference of 80 and 90. Note the higher the number the longer the runtime. In this case there is a significant increase in runtime.
In [11]: # Start timer for the whole classifier
...: script_start_time = time.time()
...:
...:
...: # Perform taxonomic classification using VSEARCH-based consensus taxonomy classifier
...: taxonomy_classification = classify_consensus_vsearch(
...: query=rep_seq_artifact,
...: reference_reads=sequence_artifact,
...: reference_taxonomy=taxonomy_artifact,
...: threads=10,
...: perc_identity=0.80,
...: strand='plus'
...: )
...:
...:
...: # Record the time at end of script
...: script_end_time = time.time()
...: # calculate total run time and convert to minutes
...: processing_time_minutes = (script_end_time - script_start_time) / 60
...: print(f"Time for classifcation: {processing_time_minutes:.2f} minutes", flush=True)
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.
Command: vsearch --usearch_global /tmp/qiime2/concertbio/data/3d3d5247-179d-4db4-96a1-035fbaefac55/data/dna-sequences.fasta --id 0.8 --query_cov 0.8 --strand plus --maxaccepts 10 --maxrejects 0 --db /tmp/qiime2/concertbio/data/4017e970-1395-4cda-bc53-215c5ecfdff5/data/dna-sequences.fasta --threads 10 --output_no_hits --blast6out /tmp/q2-BLAST6Format-l77qwapk
vsearch v2.22.1_linux_x86_64, 125.6GB RAM, 20 cores
Reading file /tmp/qiime2/concertbio/data/4017e970-1395-4cda-bc53-215c5ecfdff5/data/dna-sequences.fasta 100%
303412040 nt in 201059 seqs, min 1200, max 1800, avg 1509
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Searching 100%
Matching unique query sequences: 1966 of 1973 (99.65%)
Time for classifcation: 4.21 minutes
In [12]: # Start timer for the whole classifier
...: script_start_time = time.time()
...:
...:
...: # Perform taxonomic classification using VSEARCH-based consensus taxonomy classifier
...: taxonomy_classification = classify_consensus_vsearch(
...: query=rep_seq_artifact,
...: reference_reads=sequence_artifact,
...: reference_taxonomy=taxonomy_artifact,
...: threads=10,
...: perc_identity=0.90,
...: strand='plus'
...: )
...:
...:
...: # Record the time at end of script
...: script_end_time = time.time()
...: # calculate total run time and convert to minutes
...: processing_time_minutes = (script_end_time - script_start_time) / 60
...: print(f"Time for classifcation: {processing_time_minutes:.2f} minutes", flush=True)
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.
Command: vsearch --usearch_global /tmp/qiime2/concertbio/data/3d3d5247-179d-4db4-96a1-035fbaefac55/data/dna-sequences.fasta --id 0.9 --query_cov 0.8 --strand plus --maxaccepts 10 --maxrejects 0 --db /tmp/qiime2/concertbio/data/4017e970-1395-4cda-bc53-215c5ecfdff5/data/dna-sequences.fasta --threads 10 --output_no_hits --blast6out /tmp/q2-BLAST6Format-ku9dzd5a
vsearch v2.22.1_linux_x86_64, 125.6GB RAM, 20 cores
Reading file /tmp/qiime2/concertbio/data/4017e970-1395-4cda-bc53-215c5ecfdff5/data/dna-sequences.fasta 100%
303412040 nt in 201059 seqs, min 1200, max 1800, avg 1509
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Searching 100%
Matching unique query sequences: 1913 of 1973 (96.96%)
Time for classifcation: 42.24 minutes