I am trying to combine 2 separate runs of samples into one, and am using the following
gzcat file1.fastq.gz file2.fastq.gz | gzip > new file.fastq.gz
My question is if I can batch these somehow. I have over 150 different samples with 2 forward reads and 2 reverse reads for each sample. They are named similarly (i.e., sample1ID.fastq.gz and sample1ID2.fastq.gz). Is there a command line that can recognize the similarity, or do I have to do this individually some 300 times?
I have ~600 files that represents 2 forward and reverse reads for ~150 samples. i.e., I want to marge 2 files together, but I want to do that 300 times. I don’t want to merge them all together… if it helps I have pasted 5 of the file names below but have many more… for example I want to merge BB_S28_L001_R1_001.fastq.gz and BB_S28_L001_R1_001.fastq2.gz, and so on.
Ah ok! I think we can write a bash script that does that.
First, may I ask why you have two pairs of files for each sample? If these are from different MiSeq runs, it may be helpful to process these as separate samples. You can always combine these samples after building your feature-abundance table, and you could also test and control for batch effect caused by the two MiSeq runs.
Let me know what you think we should try. There are multiple different ways we could do this.
Isolated DNA from these samples were simply just run 2 times in order to increase granularity and the likelihood we could identify down to the species level. So 2 Miseq runs from the same sample. These are clinical samples that had a lot go into them ($, time, etc) and we wanted to maximize the amount of data.
So, essentially, since these samples are the same I figured to combine them on the front end. Since we already have several hundred unique samples and lots of data to consider I thought it might make it easier on the back end. But if there is some advantage to waiting I am unaware of, let me know what you think…
I think combining the reps now or combining them later are both pretty easy. In the final paper, you may want to merge technical reps and only discuss biological reps, and later on you can just merge samples using this plugin.
The big advantage is that you can test and control for batch effect caused by the two MiSeq runs. While the Illumina MiSeq is pretty consistent, you can test for batch effects using the beta-group-significance plugin.
So when reviewer 3 asks "What makes you think you can merge your miseq runs?!? ", you can say, “the two miseq runs were not significantly different (permanova, 999 reps, r = 0.03, p=0.09).”
Another important consideration would be the error profiles, which could vary between runs. If you were using DADA2 for example, you would want to merge after that step so that the independent runs get independent error profiles.
Does the fastq column in the metadata file have to be exactly the name of the fastq file? Currently in that column I only have the beginning of the filename (“BB”) which would not distinguish the 2 fastq runs…
I don’t think @colinbrislawn is proposing that you add a column of fastq filenames to the metadata file, but rather, a column with values that let you know which run that sample was found in. The fastq filenames won’t be very interesting here, since the reads will be merged (which means you will have half as many samples as you did files) by the time you do anything with this metadata column (as @colinbrislawn suggested, you could run some stats between the multiple runs).
For an example of a multi-run metadata file, check out the FMT tutorial.
Ok… my confusion lies in the fact that the two runs were essentially identical (i.e., every sample was included in both runs). So in order to do that, I would have to list each sample twice, and name them something unique, yes?
Do you think that worth it? That’s why my initial thought was to just merge fastqs from each of the 2 runs to have all reads from a sample together…
I hope I am making sense- Thanks again for all of your help!
correct. So for example you might structure your metadata file like this:
#SampleID Sample Replicate Run
S1A S1 A A
S1B S1 B B
S2A S2 A A
S2B S2 B B
If you do not see any differences between runs, then you are safe to merge and use your original metadata file (with sample IDs S1, S2, etc).
It is absolutely worth following @colinbrislawn’s advice. Batch effects can be a serious issue; there has been a lot of talk about batch effects elsewhere on the forum… those posts will offer good rationale for the need to check for batch effects, how to test, and how to merge.
Fortunately for you, you are just trying to collect more data not compare between samples sequenced on separate runs so if worse comes to worse you could always just use one run if there are irreparable batch effects…
So I should summarize a feature table while excluding sequences based on sampling depth, then test for the batch effect. If there is none, I should combine the files based from the 2 runs, then tabulate the table to give the DNA sequences for each sample.
Does that sound reasonable? If so, I may reach out to troubleshoot the command line for doing so…
Seriously- thank you and everyone else for your help!
(Matthew Ryan Dillon)
See the search results I linked to above — there are lots of ideas that @colinbrislawn and others have put forth on other forum posts. As a general batch-effect-testing protocol:
keep all samples separate during processing through to feature table. Differences in sampling depth between runs are not important, so long as they are appropriately normalized (e.g., rarefied to even depth for alpha and beta diversity analyses)
run qiime diversity core-metrics (or core-metrics-phylogenetic) then beta-group-significance (for the PERMANOVA test @colinbrislawn mentioned) and alpha-group-significance. For both of these use will be testing to make sure that there are no differences related to sequencing run.
Depending on the structure of your data, you may also want to use methods like qiime longitudinal pairwise-distance and pairwise-difference to test for paired differences between runs; this may be useful for detecting subtle batch effects (e.g., directional shifts in beta/alpha diversity that are overwhelmed by other factors).