Combining fast.gz files

d.davidburmeister · July 12, 2018, 9:46pm

I am trying to combine 2 separate runs of samples into one, and am using the following

gzcat file1.fastq.gz file2.fastq.gz | gzip > new file.fastq.gz

My question is if I can batch these somehow. I have over 150 different samples with 2 forward reads and 2 reverse reads for each sample. They are named similarly (i.e., sample1ID.fastq.gz and sample1ID2.fastq.gz). Is there a command line that can recognize the similarity, or do I have to do this individually some 300 times?

Thanks for any insight!

David

colinbrislawn · July 13, 2018, 5:53pm

Hello David,

Yes!

The great thing about .gz files is that you can combine them directly. For example
cat file1.fastq.gz file2.fastq.gz file3.fastq.gz > files1-3.fastq.gz
that will merge all three files.

Or ever better
cat *fastq.gz > all.fastq.gz
will merge every file that ends with fastq.gz.

Let me know if this works for you!
Colin

d.davidburmeister · July 16, 2018, 3:27pm

Colin,

Thanks for your reply!

I have ~600 files that represents 2 forward and reverse reads for ~150 samples. i.e., I want to marge 2 files together, but I want to do that 300 times. I don't want to merge them all together... if it helps I have pasted 5 of the file names below but have many more... for example I want to merge BB_S28_L001_R1_001.fastq.gz and BB_S28_L001_R1_001.fastq2.gz, and so on.

Maybe something like:

Cat *."sampleX.fastq.gz" *."sampleX2.fastq.gz" > sampleXcombined.fastq.gz

Let me know, and THANKS!!

David

BB_S28_L001_R1_001.fastq.gz merge with BB_S28_L001_R1_001.fastq2.gz

BB_S28_L001_R2_001.fastq.gz merge with BB_S28_L001_R2_001.fastq2.gz

RS49_S112_L001_R1_001.fastq.gz merge with RS49_S112_L001_R1_001.fastq2.gz

RS50_S113_L001_R1_001.fastq.gz merge with RS50_S113_L001_R1_001.fastq2.gz

B_S2_L001_R1_001.fastq.gz merge with B_S2_L001_R1_001.fastq2.gz

colinbrislawn · July 16, 2018, 4:59pm

Ah ok! I think we can write a bash script that does that.

First, may I ask why you have two pairs of files for each sample? If these are from different MiSeq runs, it may be helpful to process these as separate samples. You can always combine these samples after building your feature-abundance table, and you could also test and control for batch effect caused by the two MiSeq runs.

Let me know what you think we should try. There are multiple different ways we could do this.

Colin

d.davidburmeister · July 16, 2018, 5:37pm

Thanks for all of your help!

Isolated DNA from these samples were simply just run 2 times in order to increase granularity and the likelihood we could identify down to the species level. So 2 Miseq runs from the same sample. These are clinical samples that had a lot go into them ($, time, etc) and we wanted to maximize the amount of data.

So, essentially, since these samples are the same I figured to combine them on the front end. Since we already have several hundred unique samples and lots of data to consider I thought it might make it easier on the back end. But if there is some advantage to waiting I am unaware of, let me know what you think....

David

colinbrislawn · July 16, 2018, 6:00pm

Hello David,

Thanks for this additional detail.

I think combining the reps now or combining them later are both pretty easy. In the final paper, you may want to merge technical reps and only discuss biological reps, and later on you can just merge samples using this plugin.

The big advantage is that you can test and control for batch effect caused by the two MiSeq runs. While the Illumina MiSeq is pretty consistent, you can test for batch effects using the beta-group-significance plugin.

So when reviewer 3 asks "What makes you think you can merge your miseq runs?!? ", you can say, "the two miseq runs were not significantly different (permanova, 999 reps, r = 0.03, p=0.09)."

Colin

ebolyen · July 16, 2018, 6:13pm

Another important consideration would be the error profiles, which could vary between runs. If you were using DADA2 for example, you would want to merge after that step so that the independent runs get independent error profiles.

d.davidburmeister · July 16, 2018, 6:28pm

Ok... sorry for multiple question- I am a newbie for sure.

Would I need to set up any additional info in the metadata file to combine later?

colinbrislawn · July 16, 2018, 6:39pm

You are doing great and asking all of the right questions!

You could add a column like MiSeqRun with values one and two so that you can test for differences between them.

When it comes to merging, you would merge based on the unique patient_sample_ID, or whatever column contains BB_S28 information in your sample names.

I hope this helps!
Colin

d.davidburmeister · July 18, 2018, 11:56pm

This helps greatly.

Does the fastq column in the metadata file have to be exactly the name of the fastq file? Currently in that column I only have the beginning of the filename ("BB") which would not distinguish the 2 fastq runs...

David

thermokarst · July 19, 2018, 12:52pm

I don't think @colinbrislawn is proposing that you add a column of fastq filenames to the metadata file, but rather, a column with values that let you know which run that sample was found in. The fastq filenames won't be very interesting here, since the reads will be merged (which means you will have half as many samples as you did files) by the time you do anything with this metadata column (as @colinbrislawn suggested, you could run some stats between the multiple runs).

For an example of a multi-run metadata file, check out the FMT tutorial.

d.davidburmeister · July 19, 2018, 3:29pm

Ok... my confusion lies in the fact that the two runs were essentially identical (i.e., every sample was included in both runs). So in order to do that, I would have to list each sample twice, and name them something unique, yes?

Do you think that worth it? That's why my initial thought was to just merge fastqs from each of the 2 runs to have all reads from a sample together...

I hope I am making sense- Thanks again for all of your help!

David

Nicholas_Bokulich · July 19, 2018, 3:38pm

correct. So for example you might structure your metadata file like this:

#SampleID   Sample   Replicate   Run
S1A           S1         A        A
S1B           S1         B        B
S2A           S2         A        A
S2B           S2         B        B

If you do not see any differences between runs, then you are safe to merge and use your original metadata file (with sample IDs S1, S2, etc).

It is absolutely worth following @colinbrislawn's advice. Batch effects can be a serious issue; there has been a lot of talk about batch effects elsewhere on the forum... those posts will offer good rationale for the need to check for batch effects, how to test, and how to merge.

Fortunately for you, you are just trying to collect more data not compare between samples sequenced on separate runs so if worse comes to worse you could always just use one run if there are irreparable batch effects...

good luck!

d.davidburmeister · July 20, 2018, 12:17pm

Nicholas- thanks for your help.

Will do- I have experienced batch effect before.

So I should summarize a feature table while excluding sequences based on sampling depth, then test for the batch effect. If there is none, I should combine the files based from the 2 runs, then tabulate the table to give the DNA sequences for each sample.

Does that sound reasonable? If so, I may reach out to troubleshoot the command line for doing so...

Seriously- thank you and everyone else for your help!

David

Nicholas_Bokulich · July 20, 2018, 2:53pm

See the search results I linked to above — there are lots of ideas that @colinbrislawn and others have put forth on other forum posts. As a general batch-effect-testing protocol:

keep all samples separate during processing through to feature table. Differences in sampling depth between runs are not important, so long as they are appropriately normalized (e.g., rarefied to even depth for alpha and beta diversity analyses)
run qiime diversity core-metrics (or core-metrics-phylogenetic) then beta-group-significance (for the PERMANOVA test @colinbrislawn mentioned) and alpha-group-significance. For both of these use will be testing to make sure that there are no differences related to sequencing run.
Depending on the structure of your data, you may also want to use methods like qiime longitudinal pairwise-distance and pairwise-difference to test for paired differences between runs; this may be useful for detecting subtle batch effects (e.g., directional shifts in beta/alpha diversity that are overwhelmed by other factors).

merge away

sounds great

d.davidburmeister · July 27, 2018, 12:57am

All,

I am truly sorry to keep bugging you. I promise I tried to troubleshoot this.

I have put all fastq.gz files into one folder, and renamed them with an "A" and a "B".

example- AA_S27_L001_R1_001.fastq was the name of both files (from both runs)- I made it
AAA_S27_L001_R1_001.fastq.gz (first run)and
AAB_S27_L001_R1_001.fastq.gz (second run)

I put them into the metadata file, with 2 new columns (run number, and fastq_id which would be "AA" for the above) to look at batch effect, and to combine them later, respectively.

I am now not able to even import the files (which should have nothing to do with the metadata file, correct?) . I am inputting-

qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path CombinedRuns/
--source-format CasavaOneEightSingleLanePerSampleDirFmt
--output-path demux-paired-end.qza

And the error I am getting is:

Unrecognized file (CombinedRuns/RS76B_S158_L001_R1_001.fastq2.gz) for CasavaOneEightSingleLanePerSampleDirFmt

It is not always the same file unrecognized, but it is always from the second run ("B"). I am thinking it has to do with the barcode identifier or lane number?

Please help this newbie, and thanks for your patience!

David