Rename sample IDs within fastq.gz files

jvoelschow · June 5, 2024, 7:38pm

Hi there, I'm trying to rename sequencing files/sample ID's within the files the we received from a vendor. The files are fastq.gz paired end and the vendor returned them to us with the sample IDs really messed up. I know there's a way to change all the files names/sample ID's within the files, but I can't seem to find it.

marcosandrew · June 6, 2024, 6:49am

Hi @jvoelschow,

To rename sample IDs within FASTQ files, you can use the sed command in a shell script. For example:

zcat old_sample.fastq.gz | sed 's/old_sample/new_sample/g' | gzip > new_sample.fastq.gz

This command decompresses the file, replaces the old sample ID with the new one, and recompresses it. For more details, refer to the official GNU sed documentation.

Thanks

colinbrislawn · June 6, 2024, 2:26pm

Hello Julie,

When fixing raw files, I like using BB Tools, which you can install using conda.

conda install bbmap

rename.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> prefix=<>

Docs here:

github.com

BioInfoTools/BBMap/blob/master/sh/rename.sh

#!/bin/bash

usage(){
echo "
Written by Brian Bushnell
Last modified April 1, 2020

Description:  Renames reads to <prefix>_<number> where you specify the prefix
and the numbers are ordered.  There are other renaming modes too.
If reads are paired, pairs should be processed together; if reads are 
interleaved, the interleaved flag should be set.  This ensures that if a
read number (such as 1: or 2:) is added, it will be added correctly.

Usage:  rename.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> prefix=<>

in2 and out2 are for paired reads and are optional.
If input is paired and there is only one output file, it will be written interleaved.

Parameters:
prefix=             The string to prepend to existing read names.

This file has been truncated. show original

llenzi · June 7, 2024, 8:47am

Hi @jvoelschow.
A third way, valid if you will perform the analysis within qiime2, is by importing the sequences in qiime2 by using a manifest file. With this you can associate the correct sample name, in the 'sample-id' column, with the fastq files with the wrong names. So no need to change them at all!
Cheers

tripitakit · June 29, 2024, 6:36am

I would favor this third approach in order to keep traceability of sample's data in your analytical pipeline, especially when you have many samples and could have the necessity to review it all in the future. IMHO is much better to be explicit at all steps, to prevent wasting time recollecting ancient volatiles memories.