File Format for Importing Sequencing Data

IBakeCake · October 31, 2022, 11:31pm

Hi,

I ran into an issue during the denoising and merging step with dada2; however, I believe the root cause of this issue is due to the way I imported my sequencing files. These sequencing files are from a collaborator run on an Illumina MiSeq in the 2x300bp configuration.

I originally imported the sequencing files using the manifest file format PairedEndFastqManifestPhred33V2 with the following code:

qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path manifest_file.txt
--output-path Omni_paired-end-demux.qza
--input-format PairedEndFastqManifestPhred33V2 &

The code passes and I get a demux.qza; however when I try to validate my demultiplexed QIIME2 artifact, I received this message:

"/tmp/qiime2/kmt137/data/e1fb0b38-eb52-4391-b782-eec817cdadaa/data/18627_7_L001_R1_001.fastq.gz is not a(n) FastqGzFormat file:

Quality score length doesn't match sequence length for record beginning on line 526137"

I am not sure why the annotation of the file suggests a Casava 1.8 format, but the name of the actual files are 18627.fastq.gz. When I try to use the Casava 1.8 importing command:

qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path /home/kmt137/Omni_Samples
--input-format CasavaOneEightSingleLanePerSampleDirFmt
--output-path demux-paired-end.qza

I just received this: There was a problem importing casava-18-paired-end-demultiplexed:
Missing one or more files for CasavaOneEightSingleLanePerSampleDirFmt: '.+_.+_L[0-9][0-9][0-9]_R[12]_001\.fastq\.gz'

Downstream when I tried to run dada2 with the demux.qza under Phred33V2 format; I just received these error messaged:

error in names(answer) <- names1 :
'names' attribute [84] must be the same length as the vector [12]
Execution halted
Traceback (most recent call last):
File "/home/kmt137/.conda/envs/qiime2-2022.2/lib/python3.8/site-packages/q2_dada2/denoise.py", line 279, in denoise_paired
run_commands([cmd])
File "/home/kmt137/.conda/envs/qiime2-2022.2/lib/python3.8/site-packages/q2_dada2/denoise.py", line 36, in run_commands
subprocess.run(cmd, check=True)
File "/home/kmt137/.conda/envs/qiime2-2022.2/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['run_dada_paired.R', '/tmp/tmpjhwvkaw/forward', '/tmp/tmpjhwvkaw/reverse', '/tmp/tmpjhwvkaw_/output.tsv.biom', '/tmp/tmpjhwvkaw_/track.tsv', '/tmp/tmpjhwvkaw_/filt_f', '/tmp/tmpjhwvkaw_/filt_r', '294', '244', '6', '8', '2.0', '2.0', '2', '12', 'independent', 'consensus', '1.0', '60', '1000000']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/kmt137/.conda/envs/qiime2-2022.2/lib/python3.8/site-packages/q2cli/commands.py", line 339, in call
results = action(**arguments)
File "", line 2, in denoise_paired
File "/home/kmt137/.conda/envs/qiime2-2022.2/lib/python3.8/site-packages/qiime2/sdk/action.py", line 245, in bound_callable
outputs = self.callable_executor(scope, callable_args,
File "/home/kmt137/.conda/envs/qiime2-2022.2/lib/python3.8/site-packages/qiime2/sdk/action.py", line 391, in callable_executor
output_views = self._callable(**view_args)
File "/home/kmt137/.conda/envs/qiime2-2022.2/lib/python3.8/site-packages/q2_dada2/_denoise.py", line 292, in denoise_paired
raise Exception("An error was encountered while running DADA2"
Exception: An error was encountered while running DADA2 in R (return code 1), please inspect stdout and stderr to learn more.

Any help would be really appreciated. I think this may be a file naming issue but just wanted to confirm before asking the collaborator to fix the naming of all files.

Thanks!

Cake

gregcaporaso · November 1, 2022, 9:33pm

Hi @Cake,
I suspect what might be happening here is that one of your input files is corrupted, for example due to an interrupted download. Is that possible? Could you try to obtain your 18627.fastq.gz file again, and then redo the import with the manifest file?

Also, have you tried to generate a visual summary of your demux-paired-end.qza file? You can do that with qiime demux summarize, and that should let us get a view of the data to see if it's looking like what you expect. If you haven't already, could you generate that file and then share it on this post if you're ok with sharing the file (it won't contain any sequence data).

I am not sure why the annotation of the file suggests a Casava 1.8 format, but the name of the actual files are 18627.fastq.gz.

On import, QIIME 2 standardizes the names of the files internally in the QIIME 2 artifact (inside your demux-paired-end.qza file). For this data type, it standardizes as the Casava-formatted file names. This doesn't modify your input files or their names at all, so it really doesn't have any impact on your use of the data.

IBakeCake · November 1, 2022, 11:40pm

Hi Greg,

Thank you for taking the time to respond to my post; I greatly appreciate it.

Yes, I tried to redownload the files and import using the manifest file again based on what I read in a previous forum thread, but the dada2 denoising step failed again with the same error messages. Haha, if it helps, I usually can correct my errors using the information I find in the existing forums threads.

No problem, please see the demux-paired-end.qzv file attached for this data set. Even though the demux-paired-end.qza cannot be validated, I still was able to generate the visualization file.

I appreciate any insight you may be able to provide. The forward sequences look fine; however, the reverse reads seem to have significantly poorer quality based on the Q-scores, but does this mean the files were not fully uploaded?
Omni_paired-end-demux.qzv (309.8 KB).

If it means anything, the fastq files were provided using google cloud, which is different from what I am used to. I usually just use an SFTP to retrieve raw fastq files, but I do not see how this would be the source of the issue.

Best, Kevin

gregcaporaso · November 2, 2022, 11:41pm

Hi @IBakeCake,
Thanks for sharing the .qzv - I don't see any issues there. The run looks to be very high quality, and the decrease in quality on the reverse reads is totally normal.

Let's take a look at the 18627.fastq.gz file that you're providing as input to see if the quality score is shorter than the read, as the error message is suggesting. That would be an error in the file. If the file was corrupted in download, I would expect that be the last record. Could you try the following:

gunzip 18627.fastq.gz
tail 18627.fastq

If it looks like the quality score for the last record in that file (should be the last line) is shorter than the associated sequence (should be the third-to-last line), that's the issue. I would try to download the file again, or check to see if the provider can give you an md5 sum for the file that you could compare against the md5 sum of your local file. (This article looks like a reasonable discussion of checking md5 sums, if that isn't something you've done before.)

If the last record looks ok, let's try checking the specific line that is referred to in that error message. The following should work:

head -n 526157 18627.fastq | tail -n 20

The record we're getting the error about should be the first one that is printed in that output.

Follow up with the output from these commands if you need any help interpreting.

Greg

IBakeCake · November 3, 2022, 11:09pm

Hi @gregcaporaso

No problem. Thank you for your help troubleshooting this issue. The output from the commands below was the same. I have never had to do this step so any interpretations or materials that you feel could help I would greatly appreciate. When looking at the image, it looks like the third set of bp is missing but I have the phred33 scores. Maybe the phred scores downloaded but the bps were cut during the upload?

gunzip 18627.fastq.gz
tail 18627.fastq

and

head -n 526157 18627.fastq | tail -n 20

See image

Thank you for the article, I will contact the provider and take a look into the md5 sum files. I did not receive them originally.

All the best, Cake

gregcaporaso · November 5, 2022, 5:56pm

@IBakeCake, That actually looks fine to me - I don't think the bases or quality scores are cut off on that last one (the [kmt... that you're seeing is your command prompt coming back - is that what you're referring to?).

Hmmm... I'm at a bit of a loss here as to what could be going on. I think getting the md5 sums from the provider and checking those is a good next step. Let me know if you have questions in that process.

IBakeCake · November 6, 2022, 10:04am

Hi @gregcaporaso

I think I found the issue; do you think the presence of some adapter sequences may be interfering? Some fastq files have between 10% of sequences as adaptors and some samples have as high as 20% sequences corresponding to adapters.

I usually only run fastQC, cutadapt, trimmomatic, etc. on RNA sequencing data. This is the first time I had to do this for a 16S sequencing pipeline.

I think besides the adapters I should be good. I will try to fix the problematic files.

IBakeCake · November 6, 2022, 10:04am

@gregcaporaso

Thank you for your help; I am also at a loss here. Yea, I realized I read it wrong, the sequence length and quality scores also look fine to me as well. I will ask for the md5 sums to help address the issue.

In the meantime, I’m running the sequences through fastqc to check for any adaptor sequences that may be left in some of the forward or reverse reads. Do you have any other recommendations? I can share some fastqc outputs?

Thank you so much for the help.

gregcaporaso · November 8, 2022, 6:04pm

@IBakeCake, if you have quality scores for the adapter sequence, I don't think it would cause the issue that you're seeing (though it would still make sense to remove those adapters). Is there an issue related to Per base Sequence Quality that FastQC is identifying? That sounds like it could align with the issue you're running into.

gregcaporaso · November 8, 2022, 6:19pm

I just found a thread on the DADA2 issue tracker that suggests that this error could come up if paired end reads aren't merging. I'm not sure why that would be, but it's possible the reads aren't merging for many of your samples. Would you mind sharing your manifest_file.txt file - I just want to take a look to see if the forward and reverse read files seem to be associated correctly.

Alternatively, perhaps given the length of your reads and that you're finding adapters in the reads, maybe you're sequencing through the end of the amplicon and into adapter, and that's causing the reads to not merge. You could do an experiment here by passing, say, 200 for the two trunc_len parameters to qiime dada2 denoise-paired. That would trim the reads such that if you are sequencing through the reverse adapters they'll get trimmed off. If this turns out to fix the issue, ultimately you'd want to redo the adapter trimming with qiime cut-adapt trim-paired, as that will let you specifically trim off adapters (where my suggestion for trimming based on length is pretty crude, but allows for a quick test).

Finally, it may also be worth trying to run qiime dada2 denoise-single to see what happens then. You can run that on the same artifact (demux.qza) you're passing to qiime dada2 denoise-paired (the method will just ignore the reverse reads in that case).

IBakeCake · November 8, 2022, 9:37pm

Hi @gregcaporaso,
Thank you!

Sure, please see the manifest file attached. For some sequencing files on the manifest file, I had to modify the fastq file (remove adapters, trim) while others were okay so I lefted as zipped.

Yes, there were problems identified by FastQC and I used TrimGalore! to resolve them.

I think I found the major issue though. When I used: head -n 2808240 1868616S.fastq | tail -n 20 (as you suggested) in Linux, I could not see any issues with the sequences and their respective quality scores. When I used Microsoft visual studio code to review the same line in the entire fastq file, I noticed some sequences do not have any quality scores. (see image below)

It makes sense I was able to produce the demux-paired.qza if sequences were present but that the qiime tool's validation of the demux-paired.qza failed since quality scores were missing. Thank you for the alternative suggestions. I asked the sequencing facility to amend this issue before working on the analysis for now, but let me know if you think I should still try to use > qiime dada2 denoise-paired before resolving the Q-score issues.
OmniActive_manifest_file.txt (4.8 KB)

gregcaporaso · November 8, 2022, 9:43pm

@IBakeCake, glad you were able to make some progress with this! Getting it sorted out with the sequencing center makes sense to do first - the other suggestions were assuming that you didn't have issues like this in the file.

Don't hesitate to reach out again if you need help, and welcome to the QIIME 2 Forum!