Looping through several files

asajoh · March 2, 2023, 11:29am

We have recently downloaded taxonomy data from ncbi in order to build a classifier. We've had to download stuff piecewise as ncbi reject requests that are too large. As part of the process to tidy up these data, we've used rescript to for example filter and dereplicate sequences. We've done this with the individual downloads one by one, but it's been a bit tedious and we expect to need to process several files in the same way again in the future.

Is there a way to loop through files in qiime2 the way you would for example in R or Python? Or is it possible to run qiime code where you specify a directory instead of a file and then qiime processes all the files in the directory?

timanix · March 2, 2023, 11:41am

Hello!
Good question with several possible solutions. One can use bash scripting to loop through several files in the terminal. Another option will be using Jupyter lab/notebook to rum the mixture of python code with terminal commands (just prefix terminal command with "!" in Jupyter notebook cell and it will work). Jupyter or it's kernel can be install inside of qiime2 environment.

Example of Jupyter code:

for run in os.listdir(f'{BIG}/Import'):
       
    demux = f'{BIG}/{run}_demux.qza'
    deqzv = f'{BIG}/{run}_demux.qzv'
    
    !qiime tools import \
        --type 'SampleData[PairedEndSequencesWithQuality]' \
        --input-path $BIG/Import/$run \
        --input-format CasavaOneEightSingleLanePerSampleDirFmt \
        --output-path $demux

In jupyter one can also use q2-api so all commands could be run as python functions, check

> "Moving picture tutoral" for different interfaces

.
Best,

asajoh · March 2, 2023, 2:16pm

Hi,

thanks so much. I'm a bit of a noob. I have a directory with many different files (I should probably put them in subdirectories, but there you are), so we've tried to do a bit of pattern matching, to get only the files that we need.

Here is what we tried

dir = 'path/to/files'
pattern = "string in files"
matching_files = [f for f in os.listdir(dir) if pattern in f]

for group in matching_files:
   clean_refseqs = f'path/to/files/{group}_clean-refseq.qza'

   !qiime rescript cull-seqs \
   --i-sequences group \  #we also tried $path/to/files/$group
   --p-num-degenerates 5 \
   --p-homopolymer-length 10 \
   --o-clean-sequences $clean_refseqs

We can't get qiime to accept the input. It just tells us that it's an invalid path regardless of whether we just put "group" or the full path with dollar signs around it. My problem is most likely that I don't quite understand what the dollar signs mean and also the f' that you put in front of the path.

timanix · March 2, 2023, 2:33pm

Looks good to me!

Did you also try it like this?

dir = 'path/to/files'
pattern = "string in files"
matching_files = [f for f in os.listdir(dir) if pattern in f]

for group in matching_files:
   clean_refseqs = f'{dir}/{group}_clean-refseq.qza'

   !qiime rescript cull-seqs \
     --i-sequences $dir/$group \ 
     --p-num-degenerates 5 \
     --p-homopolymer-length 10 \
     --o-clean-sequences $clean_refseqs

asajoh · March 2, 2023, 2:39pm

We put some test files in their own directory and then tried your code exactly as described, but it would only work after we removed the initial dollar sign from the input variable in the qiime code. Not sure what that's about. We're just waiting for it to finish running (it's apparently super taxing) and then we'll try what you have suggested above to see if that works

Also, thank you so much for your help.

timanix · March 2, 2023, 2:46pm

The most important thing that it is working.
You do not need a dollar sign when you provide a path as it is and need it when you use a python variable.

All commands below are identical

dir = 'path/to/files'
file = 'file.qza'

!qiime
  -- input path/to/files/file.qza

!qiime
  -- input $dir/file.qza

!qiime
  -- input $dir/$file

asajoh · March 2, 2023, 2:49pm

Aah, thank you so much! I think I get it now

system · April 2, 2023, 8:50pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.