Hello! I have a problem that might be dumb but I can't wrap my head around the best way to merge two runs together for denoting/analysis or if I should merge after denoising, or really where to start.
I am analyzing data for a single study - the lab that prepared the 16s library had more samples than unique index primers so they split the samples into two runs. Run 1 uses custom index primers (#1 through #116) and Run 2 uses the same index primers (#1 through #39). Because they only had 116 unique index primers, they sequenced the data this way, and its paired-end data (read 1 and read 2 for each sample).
Here's an example - this is sample from patient A01 and it uses index primer#1 and was part of run1 (the first library). Fastq files from run1 are named as follows: RUN1-001-A02_S001_L001_R1_001.fastq.gz and RUN1-001-A02_S001_L001_R2_001.fastq.gz
Below is the sample from patient C49 but it was also tagged with same index primer #1 but it was part of run2 (the second library). Fastq files from run2 are named: RUN2-01-C49_S01_L001_R1_001.fastq.gz and RUN2-01-C49_S01_L001_R2_001.fastq.gz
My question is when can I make a folder that has all the joined reads from library 1 and library 2 together, so that the samples can be analyzed as one study? I did cutadapt and joined the reads for the libraries separately, but should I deblur it all together or join the libraries after deblur? How will I know which files belong to which patient if they have the same index primers?
thank you so much in advance!
Once the samples are demultiplexed is not important what primer index were used to produce them. So, if you are planning to use deblur to denoise them, you can work with all the samples at the same time.
If you use a manifest file to import the sample in qiime2, you will associate a sample name to each pair of sequence files, then all the information you need for each sample id will be found in the metadata file (patiend of origin, age, sex, treatment and so on).
As a side note, I would keep a metadata column to track from which run the sample is derived, will be important to evaluate possible run biases.
Hope it helps
Thank you for the explanation, that helps tremendously! I will keep separate metadata files in addition to one combined file that includes the metadata for both runs. I do have one quick follow up question!
I'm following qiime2 scripts left behind by a former post-doc (they're not annotated with reasonings unfortunately) - when I import the raw .fastq.gzip files (read1 and read2) for each sample, the script uses qiime tools import, then uses cutadapt trim-paired to trim out primers, qiime search to join pairs, and qiime quality-filter q-score (in this order) as steps within the first script. the final output is a joined.filtered.stats.qza that is demux summarized into a joined.filtered.qzv.
My question is - should I perform this on each run separately and THEN bring them together for the next script (which is the deblur denoise 16s)? Once I've made an artifact (several are made in this script) how do I add more samples to the artifact? I'm not sure where I bring the reads of each run together to make one big artifact with everything?
because you are planning to use deblur to denoise them, you can load all the sample into the initial artifact, Deblur was created to support co- analysis of samples from different runs. As I mentioned earlier, keep the run of origin as metadata information, so you will be able to evaluate possible biases due to the different runs.
For the record, if you would like to use dada2 for the denoising step, you have to denoise each single run separately, then you will use the qiime merge command to merge all the features-table into a final artifact (there is a merge seuqeunce command to merge the sequences too!)
Now, in principle you could use deblur to denoise each runs and then merge the results, but shoul dgine not defferent results that denoising all at the same time with deblur. However, the only way to be sure is by comparing the results of the two ways, hopefully with the help of known sample (eg positive control).