FASTQ Retrieval for NCBI SRA Upload


I ran into a small problem where I would be wrapping up a manuscript and begin the process of uploading my raw FASTQ files to NCBI SRA. However, with these FASTQ files often coming from multiple runs and many samples being filtered out before the final analysis, it was hard to parse out only the FASTQ files that I used within the study. To assist with this process, I wrote this short script which uses your final mapping file (in this case I retrieved it from my phyloseq object in R) to filter demux.qza files until you are only left with your desired samples.

cd into whichever directory your Q2 files are stored in. I have a directory for each sequencing run. My imported sequences are always named "demux.qza".

cd mydirectory/

Before running this, ensure you have input whatever your demuxed imported sequence file is called (for me it is demux.qza but change it to whatever you have called yours).


source activate "your qiime2 environment";

#if you are going to go into multiple directories, I tend to include the absolute path to my mapping file so I don't have to move it into each subdirectory
#make sure this has a Q2 compatible sample ID header if you retrieved it from another program

#filter demux.qza file using final metadata file from all runs (the one used for your analyses). Setting exclude IDs to TRUE will only keep samples NOT in your mapping file. We will use this to generate a secondary mapping file to get only the FASTQ files we want for the SRA upload later
qiime demux filter-samples --i-demux demux.qza --m-metadata-file $metadata --o-filtered-demux demux_filtered_pt1.qza --p-exclude-ids TRUE
#generate visualization of this filtered file which will include a metadata file with sample names listed
qiime demux summarize --i-data demux_filtered_pt1.qza --o-visualization demux_filtered_pt1.qzv
#export from this visualizartion to retrieve a .tsv file which can be used to filter the same way we did earlier but only keeping the desired files. 
qiime tools export --input-path demux_filtered_pt1.qzv --output-path demux
#we now use this file to filter from the first demux.qza file we used but this time only the samples IN our analysis will be kept and everything else removed
qiime demux filter-samples --i-demux demux.qza --o-filtered-demux demux_filtered_SRA_submission.qza --m-metadata-file demux/per-sample-fastq-counts.tsv --p-exclude-ids TRUE
#export the filtered demux.qza file to generate a directory containing your desired FASTQ files for SRA upload
qiime tools export --input-path demux_filtered_SRA_submission.qza --output-path SRA_FASTQs

I then do a quick mv command to compile all of these into a single directory.

mv */*/SRA_FASTQs/*.gz FASTQs_for_SRA_upload/

All of your FASTQs for only the files included in your final metadata file will be in this SRA_FASTQs folder. The important thing with this is that you can go into any folder and more or less "search" only for files that are contained in your metadata file. It doesn't matter if your metadata file has more samples listed than can be found in an individual run folder.

Not sure if this is an issue for people out there or if there is another way of taking care of this but wanted to throw it out there. I am not an experienced coder so I'm sure someone could improve on this :sweat_smile:.