Merging Reads from 2 lanes of the same sample

(Alicia R) #1


I have about 500 samples that were run on HiSeq. Because of a weird naming issue they didn’t have lane #s and thus were not easily uploaded via the Casava 1.8 format. The sequencing facility was able to fix this, but now each sample has two forward reads (one from lane1 and one from lane 2) and 2 reverse reads. They’re the same samples, but not duplicates of each other so truly need to have the data merged to have the full info for each sample. I’m wondering what the best practice for merging this type of data. I did see a previous topic (Best way to merge or group runs/samples) that was similar mention using the function ‘feature-table group’ however, I see that this function is no longer current. Is there a newer, better option, or is this still the best method.

Thank you!
Alicia Reigel

(Colin Brislawn) #2

Hello Alicia,

I’m glad you found that old post of mine. I hope your 2019 is off to a good start.

Turns out that function is still there:
(The qiime documentation breaks old links, but many of the functions still work.)

My advice from 2018 still holds: I recommend keeping all reps separate so you can measure batch effects, then merging reps if you want.

Let me know what you find!

(Alicia R) #3

Thank you! Thats good to know that the function still works.

I may have a question about the metadata files, but I need to take some time to look at the feature-table group documentation first. But I’m curious of having 2 R1 and 2 R2 for each sample will change the way I have to make my metadata file?

(Nicholas Bokulich) #4

It sounds like you want to use merge, not group, since your samples are on two different sequencing runs.

If the same samples used the same barcodes in each run, then you can use the same sample metadata file to demultiplex each run, then denoise them, then use qiime feature-table merge to merge the tables (and merge-seqs for the sequences).

If each run used different barcodes, then you will need to make two separate sample metadata files that reflect the different barcodes. After merging you can just keep using one of these since the sample IDs should be the same and the barcode does not matter after that.

(Alicia R) #5


Thank you for your response. I don’t know how this affects anything, but my samples were not run on two different runs. They’re from the exact same run but because of a weird naming issue they had to be divided up. They should definitely have the same barcodes.

Does merge still seem like the appropriate options since there shouldn’t be any batch effects?


(Nicholas Bokulich) #6

No, @colinbrislawn’s advice to use group is still best… I just wanted to check since your mention of two files for each sample made me think you may have multiple runs. merge will not work, since your reads are in the same run, will be processed together, and will produce a single feature table.

So you will need to make two metadata files: one containing the full list of samples (including duplicates), and the real sample names that you want to give all of these duplicates using the group command. Second, you will make a metadata file that has the correct names (that will be applied by the group command) and sample metadata you will use for analysis.

(Alicia R) #7

Ok, thanks. I’m going to be messing around with those metadata files as soon as I can get the below problem with the importing step fixed.

This is the command I used.

qiime tools import --type ‘SampleData[PairedEndSequencesWithQuality]’ --input-path /project/mhellbe/areige1/RawReadsJan31_2019/GraysReef/reads --source-format CasavaOneEightSingleLanePerSampleDirFmt --output-path demux-paired-end.qza

And I got this error:
There was a problem importing /project/mhellbe/areige1/RawReadsJan31_2019/GraysReef/reads:

/project/mhellbe/areige1/RawReadsJan31_2019/GraysReef/reads is not a(n) CasavaOneEightSingleLanePerSampleDirFmt:

Duplicate samples in forward reads: {‘GRTS-1’, ‘GRTS-5’, ‘GRTS-6’, ‘GRTS-2’, ‘GRTS-4’, ‘GRTS-3’}

It makes sense that there are duplicate reads because I do have two of each forward and reverse, but with different lane #s. So is it the CasavaOneEightSingleLanePerSampleDirFmt that is wrong? I just assumed that the Lane # difference in their labels would be enough to assume them as different samples.

(Nicholas Bokulich) #8

Yeah different lane #s is not enough to avoid name clashes, since the sample name is being pulled from the prefix of each filename. You could:

  1. Use manifest format
  2. rename those files so they are still Casava format but have different names, e.g., name one pair “GRTS-1A”.
  3. What I would personally do: concatenate the fastq files, since these are in fact duplicates of the same sample and barcode.

(Alicia R) #9

Yes, I’ve been thinking about concatenating them as well. Its all much more hassle to have them as separate files.

(Alicia R) #10

Hi Nicholas,

So as a test I used just a single sample and I concatenated the fw and rv reads. That ran through the import fine. I did get an usual comment on the interactive quality plot (see file attached). The comment was in red and said “The plot at position 157 was generated using a random sampling of 5071 out of 1475320 sequences without replacement. This position (157) is greater than the minimum sequence length observed during subsampling (58 bases). As a result, the plot at this position is not based on data from all of the sequences, so it should be interpreted with caution when compared to plots for other positions. Outlier quality scores are not shown in box plots for clarity.” Any help on the dummy version of what this is saying?

Secondly, when I tried to use dada2 denoise on this sequence just as a test I got an error that I don’t quite understand either. It is the following and its long so I apologize for that:

MY COMMAND: qiime dada2 denoise-paired --i-demultiplexed-seqs demux-paired-end.qza --p-trim-left-f 0 --p-trunc-len-f 200 --p-trim-left-r 0 --p-trunc-len-r 200 --o-representative-sequences rep-seqs-dada2.250.qza --o-table table-dada2.250.qza --p-n-threads 16 --o-denoising-stats denoising-stats-dada2-250.qza --verbose

R version 3.4.1 (2017-06-30)
Loading required package: Rcpp
DADA2 R package version: 1.6.0

  1. Filtering Error in add(bin) : internal: buf !=
    Execution halted
    Traceback (most recent call last):
    File “/usr/local/packages/qiime2/2018.4/lib/python3.5/site-packages/q2_dada2/”, line 229, in denoise_paired
    File “/usr/local/packages/qiime2/2018.4/lib/python3.5/site-packages/q2_dada2/”, line 36, in run_commands, check=True)
    File “/usr/local/packages/qiime2/2018.4/lib/python3.5/”, line 398, in run
    output=stdout, stderr=stderr)
    subprocess.CalledProcessError: Command ‘[‘run_dada_paired.R’, ‘/tmp/tmpsad_b3h3/forward’, ‘/tmp/tmpsad_b3h3/reverse’, ‘/tmp/tmpsad_b3h3/output.tsv.biom’, ‘/tmp/tmpsad_b3h3/track.tsv’, ‘/tmp/tmpsad_b3h3/filt_f’, ‘/tmp/tmpsad_b3h3/filt_r’, ‘200’, ‘200’, ‘0’, ‘0’, ‘2.0’, ‘2’, ‘consensus’, ‘1.0’, ‘16’, ‘1000000’]’ returned non-zero exit status 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/packages/qiime2/2018.4/lib/python3.5/site-packages/q2cli/”, line 274, in call
results = action(**arguments)
File “”, line 2, in denoise_paired
File “/usr/local/packages/qiime2/2018.4/lib/python3.5/site-packages/qiime2/sdk/”, line 231, in bound_callable
output_types, provenance)
File “/usr/local/packages/qiime2/2018.4/lib/python3.5/site-packages/qiime2/sdk/”, line 366, in callable_executor
output_views = self._callable(**view_args)
File “/usr/local/packages/qiime2/2018.4/lib/python3.5/site-packages/q2_dada2/”, line 244, in denoise_paired
" and stderr to learn more." % e.returncode)
Exception: An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.

Plugin error from dada2:

An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.

See above for debug info.
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_paired.R /tmp/tmpsad_b3h3/forward /tmp/tmpsad_b3h3/reverse /tmp/tmpsad_b3h3/output.tsv.biom /tmp/tmpsad_b3h3/track.tsv /tmp/tmpsad_b3h3/filt_f /tmp/tmpsad_b3h3/filt_r 200 200 0 0 2.0 2 consensus 1.0 16 1000000

Are these two errors somehow related? If they are any idea what might be causing them? This is just from a single sample, single fw and single rv fastq. I did concatenate them by hand, but that shouldn’t be a huge issue I don’t think. I was pretty careful when I did it.

Any advice on how to get this fixed?

demux.qzv (266.9 KB)

(Nicholas Bokulich) #11

Hi @reige012,
It sounds like concatenation (or some other modification) is causing both problems.

This indicates that all but 5071 sequences are shorter than that length. This appears at all bases > 1 nt, so something is definitely wrong.

We have only seen this error one other time, and that was similarly when some sort of pre-importing modification was being used (though that was never clarified nor resolved): Quality-filter paired end illumina

I agree, concatenation should not be a big issue, but clearly something is going wrong — make sure you do not have blank lines or special characters in your sequences. In the end, using a manifest format to import your sequences and grouping duplicates later on may just be a more hassle-free approach… while I would personally concatenate the sequences before importing, it is not worth it if it makes QIIME 2 unhappy! The more general rule I preach and observe is to not modify my data prior to importing, to avoid headaches downstream.