Importing fastq.gz files (Cassava 1.8)

Hello,

I have what might be a basic question about the file names for importing data into QIIME 2 via the cassava 1.8 importing procedure. These paired-end reads have been demultiplexed via bcl2fastq, processed for low-quality reads and the adapter sequences were trimmed by the sequencing facility already and we received 2 fastq.gz files per sample, which are named like…

SO_7139_5_105_S46_R1_001.fastq.gz
SO_7139_5_105_S46_R2_001.fastq.gz

(Definitions - SO_7139 = project ID, 5_105 = sample ID, S46 = “internal nomenclature”, R1= forward read, R2=reverse read, and 001 = “internal nomenclature”)

Do these need to be renamed similarly to the below (examples from the Cassava 1.8 paired-end data importing tutorial):
e.g., L2S357_15_L001_R1_001.fastq.gz
e.g., L2S357_15_L001_R2_001.fastq.gz
(sampleID_barcode_lanenumber_read_setnumber.fastq.gz)
In that case, is the barcode identifier and lane number arbitrary, as I don’t have this particular info from the sequencing company?

Many thanks from a sequence data novice :slight_smile:

1 Like

Hello!
I believe you are correct in that the files need to be renamed using the nomenclature you mentioned. In particular, the L2S357 would refer to the sampleID in your metadata. Below I have shown an example of how my files are named:

sampleID_01_L001_R1_001.fastq.gz.
sampleID_01_L001_R2_001.fastq.gz.

Hope that helps!

1 Like

Thanks @Sirtaj-Singh! :tada:

@slh277, if you don’t want to rename your files, your other option is to use one of the manifest formats, which lets you define a text-file manifest of what files map to what samples. Take your pick — either approach will work, it just comes down to your personal preference in this case!

Good luck and keep us posted! :t_rex:

3 Likes

@Sirtaj-Singh @thermokarst Thank you so much for your help, and sorry for my slow response! I tried renaming the files as a first step – if that didn’t help I would then try a manifest format which I have yet to do. Renaming seemed to do the trick. Thank you!!!

~Sam

3 Likes

Hi, I am having trouble again -

I got to the DADA2 step, but came up with the following error message:

Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_paired.R /var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmpcjpv4que/forward /var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmpcjpv4que/reverse /var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmpcjpv4que/output.tsv.biom /var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmpcjpv4que/filt_f /var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmpcjpv4que/filt_r 200 200 0 0 2.0 2 consensus 1.0 0 1000000

R version 3.3.2 (2016-10-31) 
Loading required package: Rcpp
There were 50 or more warnings (use warnings() to see the first 50)
DADA2 R package version: 1.4.0 
1) Filtering ........................................Error in fastqPairedFilter(c(unfiltsF[[i]], unfiltsR[[i]]), c(filteredFastqF,  : 
  Mismatched forward and reverse sequence files: 713220, 789948.
Execution halted
Traceback (most recent call last):
  File "/Users/sm939/miniconda3/envs/qiime2-2017.11/lib/python3.5/site-packages/q2_dada2/_denoise.py", line 179, in denoise_paired
    run_commands([cmd])
  File "/Users/sm939/miniconda3/envs/qiime2-2017.11/lib/python3.5/site-packages/q2_dada2/_denoise.py", line 35, in run_commands
    subprocess.run(cmd, check=True)
  File "/Users/sm939/miniconda3/envs/qiime2-2017.11/lib/python3.5/subprocess.py", line 398, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['run_dada_paired.R', '/var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmpcjpv4que/forward', '/var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmpcjpv4que/reverse', '/var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmpcjpv4que/output.tsv.biom', '/var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmpcjpv4que/filt_f', '/var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmpcjpv4que/filt_r', '200', '200', '0', '0', '2.0', '2', 'consensus', '1.0', '0', '1000000']' returned non-zero exit status 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/sm939/miniconda3/envs/qiime2-2017.11/lib/python3.5/site-packages/q2cli/commands.py", line 218, in __call__
    results = action(**arguments)
  File "<decorator-gen-354>", line 2, in denoise_paired
  File "/Users/sm939/miniconda3/envs/qiime2-2017.11/lib/python3.5/site-packages/qiime2/sdk/action.py", line 220, in bound_callable
    output_types, provenance)
  File "/Users/sm939/miniconda3/envs/qiime2-2017.11/lib/python3.5/site-packages/qiime2/sdk/action.py", line 355, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/Users/sm939/miniconda3/envs/qiime2-2017.11/lib/python3.5/site-packages/q2_dada2/_denoise.py", line 194, in denoise_paired
    " and stderr to learn more." % e.returncode)
Exception: An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.

I’m not sure why the files are mismatched? Or what the issue is, exactly? Any help would be very appreciated on this. Thank you so much!

Hi @slh277!

Mismatched forward and reverse sequence files: 713220, 789948.

Those numbers are pretty big differences in reads counts - if I was a gamblin’ man, I would put my money on a typo when you were renaming your files! This is pretty easy to do, so maybe go back and double-check your work there to make sure you didn’t mis-label something. Keep us posted! :t_rex:

2 Likes

Thanks for pointing that out! In case renaming was the problem, I tried importing using the Manifest format so that the files were not renamed, but I still got the same error up doing qiime dada2 (below). Could it be something in the code I used?

Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_paired.R /var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmp7dm69qo3/forward /var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmp7dm69qo3/reverse /var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmp7dm69qo3/output.tsv.biom /var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmp7dm69qo3/filt_f /var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmp7dm69qo3/filt_r 250 250 0 0 2.0 2 consensus 1.0 0 1000000

R version 3.3.2 (2016-10-31) 
Loading required package: Rcpp
There were 50 or more warnings (use warnings() to see the first 50)
DADA2 R package version: 1.4.0 
1) Filtering ...................................Error in fastqPairedFilter(c(unfiltsF[[i]], unfiltsR[[i]]), c(filteredFastqF,  : 
  **Mismatched forward and reverse sequence files: 713220, 789948.**
Execution halted
Traceback (most recent call last):
  File "/Users/sm939/miniconda3/envs/qiime2-2017.11/lib/python3.5/site-packages/q2_dada2/_denoise.py", line 179, in denoise_paired
    run_commands([cmd])
  File "/Users/sm939/miniconda3/envs/qiime2-2017.11/lib/python3.5/site-packages/q2_dada2/_denoise.py", line 35, in run_commands
    subprocess.run(cmd, check=True)
  File "/Users/sm939/miniconda3/envs/qiime2-2017.11/lib/python3.5/subprocess.py", line 398, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['run_dada_paired.R', '/var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmp7dm69qo3/forward', '/var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmp7dm69qo3/reverse', '/var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmp7dm69qo3/output.tsv.biom', '/var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmp7dm69qo3/filt_f', '/var/folders/8x/rqqs6m296k7g3trbdz559cnm0000gp/T/tmp7dm69qo3/filt_r', '250', '250', '0', '0', '2.0', '2', 'consensus', '1.0', '0', '1000000']' returned non-zero exit status 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/sm939/miniconda3/envs/qiime2-2017.11/lib/python3.5/site-packages/q2cli/commands.py", line 218, in __call__
    results = action(**arguments)
  File "<decorator-gen-354>", line 2, in denoise_paired
  File "/Users/sm939/miniconda3/envs/qiime2-2017.11/lib/python3.5/site-packages/qiime2/sdk/action.py", line 220, in bound_callable
    output_types, provenance)
  File "/Users/sm939/miniconda3/envs/qiime2-2017.11/lib/python3.5/site-packages/qiime2/sdk/action.py", line 355, in _callable_executor_
    output_views = self._callable(**view_args)
  File "/Users/sm939/miniconda3/envs/qiime2-2017.11/lib/python3.5/site-packages/q2_dada2/_denoise.py", line 194, in denoise_paired
    " and stderr to learn more." % e.returncode)
Exception: An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.

Code I use for importing:

qiime tools import 
--type 'SampleData[PairedEndSequencesWithQuality]' 
--input-path Manifest.txt   
--output-path demux-paired-end.qza  
--source-format PairedEndFastqManifestPhred33

When I import the initial data look like this, regardless of using Casava 1.8 or Manifest format:


The DADA2 code I used is here, for 53 paired end reads (106 reads total) using V3-V4 region:

qiime dada2 denoise-paired 
--i-demultiplexed-seqs demux-paired-end.qza 
--p-trim-left-f 0 
--p-trunc-len-f 250 
--p-trim-left-r 0 
--p-trunc-len-r 250 
--o-representative-sequences demux-paired-end.qza 
--p-n-threads 0 
--o-table table-dada2.qza

If anyone can shed some light on what I might be doing wrong, that would be amazing.

I think you might want to double-check the names of the files themselves — you mentioned above that you renamed them to make them work with the Casava1.8 format — if you made a mistake while renaming them, that same mistake would hold for the fastq-manifest import, too.

The only real difference between these two formats is that the fastq-manifest is general-purpose - there are no conventions with the file name. The Casava1.8 format has pretty strict naming requirements, but is also very common output from Illumina sequencing product, so it is just intended to be a “convenience” import for Illumina users.

1 Like

Thank you so much for all of your help @thermokarst!

I’m not sure then, for the Manifest file, I am using the original file names sent from the company rather than my renamed files. For the original ‘raw’ data, the company says they already trimmed off the adapter sequences, filtered out low quality reads, and demultiplexed and what they sent us is just read 1 and read 2. So I thought after importing it should be ready for DADA2. Am I missing a step? Is there any other reason that mismatching would occur aside from file renaming?

My manifest file is below:

sample-id,absolute-filepath,direction
sample-1,$PWD/Genotypic_09232017/SO_7139_5_105_S46_R1_001.fastq.gz,forward
sample-1,$PWD/Genotypic_09232017/SO_7139_5_105_S46_R2_001.fastq.gz,reverse
sample-2,$PWD/Genotypic_09232017/SO_7139_5_111_R1.fastq.gz,forward
sample-2,$PWD/Genotypic_09232017/SO_7139_5_111_R2.fastq.gz,reverse
sample-3,$PWD/Genotypic_09232017/SO_7139_5_112_S45_R1_001.fastq.gz,forward
sample-3,$PWD/Genotypic_09232017/SO_7139_5_112_S45_R2_001.fastq.gz,reverse
sample-4,$PWD/Genotypic_09232017/SO_7139_5_113_S43_R1_001.fastq.gz,forward
sample-4,$PWD/Genotypic_09232017/SO_7139_5_113_S43_R2_001.fastq.gz,reverse
sample-5,$PWD/Genotypic_09232017/SO_7139_5_117_S44_R1_001.fastq.gz,forward
sample-5,$PWD/Genotypic_09232017/SO_7139_5_117_S44_R2_001.fastq.gz,reverse
sample-6,$PWD/Genotypic_09232017/SO_7139_5_120_S21_R1_001.fastq.gz,forward
sample-6,$PWD/Genotypic_09232017/SO_7139_5_120_S21_R2_001.fastq.gz,reverse
sample-7,$PWD/Genotypic_09232017/SO_7139_5_132_S34_R1_001.fastq.gz,forward
sample-7,$PWD/Genotypic_09232017/SO_7139_5_132_S34_R2_001.fastq.gz,reverse
sample-8,$PWD/Genotypic_09232017/SO_7139_5_135_S1_R1_001.fastq.gz,forward
sample-8,$PWD/Genotypic_09232017/SO_7139_5_135_S1_R2_001.fastq.gz,reverse
sample-9,$PWD/Genotypic_09232017/SO_7139_5_137_S42_R1_001.fastq.gz,forward
sample-9,$PWD/Genotypic_09232017/SO_7139_5_137_S42_R2_001.fastq.gz,reverse
sample-10,$PWD/Genotypic_09232017/SO_7139_5_138_S40_R1_001.fastq.gz,forward
sample-10,$PWD/Genotypic_09232017/SO_7139_5_138_S40_R2_001.fastq.gz,reverse
sample-11,$PWD/Genotypic_09232017/SO_7139_5_139_S49_R1_001.fastq.gz,forward
sample-11,$PWD/Genotypic_09232017/SO_7139_5_139_S49_R2_001.fastq.gz,reverse
sample-12,$PWD/Genotypic_09232017/SO_7139_5_144_R1.fastq.gz,forward
sample-12,$PWD/Genotypic_09232017/SO_7139_5_144_R2.fastq.gz,reverse
sample-13,$PWD/Genotypic_09232017/SO_7139_5_145_S6_R1_001.fastq.gz,forward
sample-13,$PWD/Genotypic_09232017/SO_7139_5_145_S6_R2_001.fastq.gz,reverse
sample-14,$PWD/Genotypic_09232017/SO_7139_5_151_S2_R1_001.fastq.gz,forward
sample-14,$PWD/Genotypic_09232017/SO_7139_5_151_S2_R2_001.fastq.gz,reverse
sample-15,$PWD/Genotypic_09232017/SO_7139_5_156_S7_R1_001.fastq.gz,forward
sample-15,$PWD/Genotypic_09232017/SO_7139_5_156_S7_R2_001.fastq.gz,reverse
sample-16,$PWD/Genotypic_09232017/SO_7139_5_157_S9_R1_001.fastq.gz,forward
sample-16,$PWD/Genotypic_09232017/SO_7139_5_157_S9_R2_001.fastq.gz,reverse
sample-17,$PWD/Genotypic_09232017/SO_7139_5_159_S6_R1_001.fastq.gz,forward
sample-17,$PWD/Genotypic_09232017/SO_7139_5_159_S6_R2_001.fastq.gz,reverse
sample-18,$PWD/Genotypic_09232017/SO_7139_5_160_S4_R1_001.fastq.gz,forward
sample-18,$PWD/Genotypic_09232017/SO_7139_5_160_S4_R2_001.fastq.gz,reverse
sample-19,$PWD/Genotypic_09232017/SO_7139_5_163_S8_R1_001.fastq.gz,forward
sample-19,$PWD/Genotypic_09232017/SO_7139_5_163_S8_R2_001.fastq.gz,reverse
sample-20,$PWD/Genotypic_09232017/SO_7139_5_165_S41_R1_001.fastq.gz,forward
sample-20,$PWD/Genotypic_09232017/SO_7139_5_165_S41_R2_001.fastq.gz,reverse
sample-21,$PWD/Genotypic_09232017/SO_7139_5_166_R1.fastq.gz,forward
sample-21,$PWD/Genotypic_09232017/SO_7139_5_166_R2.fastq.gz,reverse
sample-22,$PWD/Genotypic_09232017/SO_7139_6_103_S4_R1_001.fastq.gz,forward
sample-22,$PWD/Genotypic_09232017/SO_7139_6_103_S4_R2_001.fastq.gz,reverse
sample-23,$PWD/Genotypic_09232017/SO_7139_6_107_S10_R1_001.fastq.gz,forward
sample-23,$PWD/Genotypic_09232017/SO_7139_6_107_S10_R2_001.fastq.gz,reverse
sample-24,$PWD/Genotypic_09232017/SO_7139_6_115_S3_R1_001.fastq.gz,forward
sample-24,$PWD/Genotypic_09232017/SO_7139_6_115_S3_R2_001.fastq.gz,reverse
sample-25,$PWD/Genotypic_09232017/SO_7139_6_120_R1.fastq.gz,forward
sample-25,$PWD/Genotypic_09232017/SO_7139_6_120_R2.fastq.gz,reverse
sample-26,$PWD/Genotypic_09232017/SO_7139_6_123_S39_R1_001.fastq.gz,forward
sample-26,$PWD/Genotypic_09232017/SO_7139_6_123_S39_R2_001.fastq.gz,reverse
sample-27,$PWD/Genotypic_09232017/SO_7139_6_138_S36_R1_001.fastq.gz,forward
sample-27,$PWD/Genotypic_09232017/SO_7139_6_138_S36_R2_001.fastq.gz,reverse
sample-28,$PWD/Genotypic_09232017/SO_7139_6_139_S12_R1_001.fastq.gz,forward
sample-28,$PWD/Genotypic_09232017/SO_7139_6_139_S12_R2_001.fastq.gz,reverse
sample-29,$PWD/Genotypic_09232017/SO_7139_6_140_S7_R1_001.fastq.gz,forward
sample-29,$PWD/Genotypic_09232017/SO_7139_6_140_S7_R2_001.fastq.gz,reverse
sample-30,$PWD/Genotypic_09232017/SO_7139_6_146_S15_R1_001.fastq.gz,forward
sample-30,$PWD/Genotypic_09232017/SO_7139_6_146_S15_R2_001.fastq.gz,reverse
sample-31,$PWD/Genotypic_09232017/SO_7139_6_150_R1.fastq.gz,forward
sample-31,$PWD/Genotypic_09232017/SO_7139_6_150_R2.fastq.gz,reverse
sample-32,$PWD/Genotypic_09232017/SO_7139_6_157_S32_R1_001.fastq.gz,forward
sample-32,$PWD/Genotypic_09232017/SO_7139_6_157_S32_R2_001.fastq.gz,reverse
sample-33,$PWD/Genotypic_09232017/SO_7139_6_158_S38_R1_001.fastq.gz,forward
sample-33,$PWD/Genotypic_09232017/SO_7139_6_158_S38_R2_001.fastq.gz,reverse
sample-34,$PWD/Genotypic_09232017/SO_7139_6_159_S33_R1_001.fastq.gz,forward
sample-34,$PWD/Genotypic_09232017/SO_7139_6_159_S33_R2_001.fastq.gz,reverse
sample-35,$PWD/Genotypic_09232017/SO_7139_6_160_R1.fastq.gz,forward
sample-35,$PWD/Genotypic_09232017/SO_7139_6_160_R1.fastq.gz,reverse <–I just noticed this typo but would this one thing cause 50 errors?
sample-36,$PWD/Genotypic_09232017/SO_7139_6_161_S37_R1_001.fastq.gz,forward
sample-36,$PWD/Genotypic_09232017/SO_7139_6_161_S37_R2_001.fastq.gz,reverse
sample-37,$PWD/Genotypic_09232017/SO_7139_6_162_R1.fastq.gz,forward
sample-37,$PWD/Genotypic_09232017/SO_7139_6_162_R2.fastq.gz,reverse
sample-38,$PWD/Genotypic_09232017/SO_7139_6_163_S28_R1_001.fastq.gz,forward
sample-38,$PWD/Genotypic_09232017/SO_7139_6_163_S28_R2_001.fastq.gz,reverse
sample-39,$PWD/Genotypic_09232017/SO_7139_6_164_R1.fastq.gz,forward
sample-39,$PWD/Genotypic_09232017/SO_7139_6_164_R2.fastq.gz,reverse
sample-40,$PWD/Genotypic_09232017/SO_7139_6_165_S13_R1_001.fastq.gz,forward
sample-40,$PWD/Genotypic_09232017/SO_7139_6_165_S13_R2_001.fastq.gz,reverse
sample-41,$PWD/Genotypic_09232017/SO_7139_6_166_R1.fastq.gz,forward
sample-41,$PWD/Genotypic_09232017/SO_7139_6_166_R2.fastq.gz,reverse
sample-42,$PWD/Genotypic_09232017/SO_7139_7_102_S20_R1_001.fastq.gz,forward
sample-42,$PWD/Genotypic_09232017/SO_7139_7_102_S20_R2_001.fastq.gz,reverse
sample-43,$PWD/Genotypic_09232017/SO_7139_7_104_S24_R1_001.fastq.gz,forward
sample-43,$PWD/Genotypic_09232017/SO_7139_7_104_S24_R2_001.fastq.gz,reverse
sample-44,$PWD/Genotypic_09232017/SO_7139_7_105_S27_R1_001.fastq.gz,forward
sample-44,$PWD/Genotypic_09232017/SO_7139_7_105_S27_R2_001.fastq.gz,reverse
sample-45,$PWD/Genotypic_09232017/SO_7139_7_106_S31_R1_001.fastq.gz,forward
sample-45,$PWD/Genotypic_09232017/SO_7139_7_106_S31_R2_001.fastq.gz,reverse
sample-46,$PWD/Genotypic_09232017/SO_7139_7_107_S22_R1_001.fastq.gz,forward
sample-46,$PWD/Genotypic_09232017/SO_7139_7_107_S22_R2_001.fastq.gz,reverse
sample-47,$PWD/Genotypic_09232017/SO_7139_7_108_S30_R1_001.fastq.gz,forward
sample-47,$PWD/Genotypic_09232017/SO_7139_7_108_S30_R2_001.fastq.gz,reverse
sample-48,$PWD/Genotypic_09232017/SO_7139_7_109_R1.fastq.gz,forward
sample-48,$PWD/Genotypic_09232017/SO_7139_7_109_R2.fastq.gz,reverse
sample-49,$PWD/Genotypic_09232017/SO_7139_7_110_R1.fastq.gz,forward
sample-49,$PWD/Genotypic_09232017/SO_7139_7_110_R2.fastq.gz,reverse
sample-50,$PWD/Genotypic_09232017/SO_7139_7_111_S14_R1_001.fastq.gz,forward
sample-50,$PWD/Genotypic_09232017/SO_7139_7_111_S14_R2_001.fastq.gz,reverse
sample-51,$PWD/Genotypic_09232017/SO_7139_8_101_S16_R1_001.fastq.gz,forward
sample-51,$PWD/Genotypic_09232017/SO_7139_8_101_S16_R2_001.fastq.gz,reverse
sample-52,$PWD/Genotypic_09232017/SO_7139_8_102_S18_R1_001.fastq.gz,forward
sample-52,$PWD/Genotypic_09232017/SO_7139_8_102_S18_R2_001.fastq.gz,reverse
sample-53,$PWD/Genotypic_09232017/SO_7139_8_107_R1.fastq.gz,forward
sample-53,$PWD/Genotypic_09232017/SO_7139_8_107_R2.fastq.gz,reverse

Just a side-note @slh277 here. I noticed that your truncating parameters are very lenient considering the quality graphs you showed. Your truncating point of 250 might be ok for your forward reads but the reverse reads look like they would suffer greatly from that. They look like they drop sharply in quality around 120. I’m not entirely sure of how DADA2 would handle this poor quality set but with my experience of playing around with poor quality paired ends, it doesn’t lead to useful results. It may even lead to low, poor, or not at all joining of the paired ends? Way better to just use the forward reads. Someone with better understanding of that can comment on that!

1 Like

Hi @slh277! Can you run the following one-liner in your data directory:

$ for f in *.fastq.gz; do r=$(( $(gunzip -c $f | wc -l | tr -d '[:space:]') / 4 )); echo $r $f; done

If this thing works, you should see something like this:

2 sample_a_AAAA_L001_R1_001.fastq.gz
1 sample_b_GGGG_L001_R1_001.fastq.gz
3 sample_c_CCCC_L001_R1_001.fastq.gz

This will give you the record-count for each sample (the number of lines in the file divided by 4, since there are 4 lines per record in a fastq file) — these should be identical between pairs. The first thing I would search for is to check for any mismatches there.

Keep me posted, we will get to the bottom of this! :t_rex:

2 Likes

Thanks for pointing that out @Mehrbod_Estaki! @slh277, when we get your data squared away, I think the points @Mehrbod_Estaki brings up are really good ones - you will want to experiment with settings that make the most sense for your analysis. Thanks!

3 Likes

Hi @thermokarst, I tried this but it came back with " -bash: syntax error near unexpected token `do’ " - do you know why?

Thanks!

@Mehrbod_Estaki Thank you so much for pointing this out. I will definitely look for more guidance regarding this aspect of the data. Many thanks!

1 Like

Did you accidentally copy-and-paste the leading $? That is typically used to denote a prompt, so the code you want to copy and paste is:

for f in *.fastq.gz; do r=$(( $(gunzip -c $f | wc -l | tr -d '[:space:]') / 4 )); echo $r $f; done

I tested this in bash3.2 and zsh5.2, FWIW!

1 Like

Ah yes, that is what I did, now I feel silly. Thanks :slight_smile: the code is working now!

@thermokarst I saw one mismatch:
713220 SO_7139_6_166_R1.fastq.gz
789948 SO_7139_6_166_R2.fastq.gz

When you have a chance, can you guide me as to what the next step to look into this more would be? Thank you!

1 Like

Hey, those numbers look familiar:

I would start double-checking your data files — did something get munged when renaming? For example, are there any other files with those same read counts? If so, better double-check that all of those are the samples they claim to be. If not (for either case), I would contact the sequencing facility to see if they have any thoughts. Another option is to remove that sample from your study, but obviously that comes with a host of other considerations related to your specific study.

Keep us posted! :t_rex:

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.