Hi,
I am using:
- QIIME 2 Plugin 'rescript' version 2024.5.1 (from package 'rescript' version 2024.5.1).
- Cluster (--mem=200G, -c 12)
- Partial Rescript Code: qiime rescript extract-seq-segments \
--i-input-sequences ${X} \
--i-reference-segment-sequences /scratch/username/rescript/filtered/${Y}.qza \
--p-perc-identity 0.8 \
--p-min-seq-len ${min_seq_len} \
--p-threads 12 \
--o-extracted-sequence-segments ${extracted_output} \
--o-unmatched-sequences ${unmatched_output} \
--verbose
One of the slurm*.out gives me:
SLURM_ARRAY_TASK_ID: 188
X: /scratch/username/rescript/filtered/genbank_8_b/genbank_8_b.fasta.split/genbank_8_b.part_035.qza
Y: ref_ch
min_seq_len: 29
vsearch v2.22.1_linux_x86_64, 376.4GB RAM, 32 cores
https://github.com/torognes/vsearch
Reading file /scratch/username/metacurator/temp/qiime2/username/data/0f6bab7d-ff06-41ad-9942-81d12653a6a3/data/dna-sequences.fasta 100%
171767 nt in 1174 seqs, min 109, max 211, avg 146
Masking 100%
Counting k-mers 100%
Creating k-mer index 100%
Searching 100%
Matching unique query sequences: 95 of 769 (12.35%)
Running external command line application. This may print messages to stdout and/or stderr.
The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist.
Command: vsearch --usearch_global /scratch/username/metacurator/temp/qiime2/username/data/d90eac1e-3c72-4a54-a72f-a606981bc119/data/dna-sequences.fasta --db /scratch/username/metacurator/temp/qiime2/username/data/0f6bab7d-ff06-41ad-9942-81d12653a6a3/data/dna-sequences.fasta --id 0.8 --strand plus --threads 12 --qmask none --qsegout /scratch/username/metacurator/temp/q2-DNAFASTAFormat-4gbjoyb9 --notmatched /scratch/username/metacurator/temp/q2-DNAFASTAFormat-_lpy2p7u --minseqlength 29
Saved FeatureData[Sequence] to: /scratch/username/rescript/filtered/genbank_8_b/genbank_8_b.fasta.split/processed/genbank_8_b.part_035.qza_ref_ch_extracted.qza
Saved FeatureData[Sequence] to: /scratch/username/rescript/filtered/genbank_8_b/genbank_8_b.fasta.split/processed/genbank_8_b.part_035.qza_ref_ch_unmatched.qza
I am running multiple files using the same reference sequences (4 trnL reference sequences: CD, CH, GH, GD). It seems the reference (seed) sequence is always saved under the same temporary folder (-db /scratch/username/metacurator/temp/qiime2/username/data/0f6bab7d-ff06-41ad-9942-81d12653a6a3/data/dna-sequences.fasta). {{{ --usearch_global /scratch/username/metacurator/temp/qiime2/username/data/d90eac1e-3c72-4a54-a72f-a606981bc119/data/dna-sequences.fasta appears only once in the slurm output }}}. The following command confirms this, as the same temporary folder ID appears in multiple SLURM output files: [ grep -l "0f6bab7d-ff06-41ad-9942-81d12653a6a3" ./slurm*
./slurm-4266924_158.out
./slurm-4266924_161.out
./slurm-4266924_164.out
./slurm-4266924_167.out
./slurm-4266924_170.out
./slurm-4266924_173.out
./slurm-4266924_176.out
./slurm-4266924_179.out
./slurm-4266924_182.out
./slurm-4266924_185.out
./slurm-4266924_188.out
./slurm-4266924_191.out ]
This behavior may lead to the error message: "The command being run is below. This command cannot be manually re-run as it will depend on temporary files that no longer exist." error.
However, the output file is still created, likely using a partially converted seed FASTA file from another run.
My question is: Is my understanding correct? If so, this implies that I cannot run multiple files using the same seed sequence simultaneously. Do you have any suggestions to circumvent this issue?
{I have hundreds of files to run }