Bizarre sequence quality from a Nature paper's NCBI data

gcagle1 · November 27, 2019, 9:02pm

Greetings forum,

Perhaps a common scenario: there is a paper published in Nature related to your area of research with publicly available sequencing data you want to further investigate. But have you ever downloaded the sequencing data and found it was of unusually poor quality?

I downloaded published 16S V4 amplicon sequences off of the NCBI database for the ~220 samples from the experiment. I created a manifest file and imported the sequences into qiime2 (version 2018.4 due to availability on our HPC). I ran demux summarize and got the weirdest quality plots I have ever seen.

I used grep to identify that the 515/806 primers are located 2 bases from the left end of both F and R reads. Therefore I'm assuming these reads are 150bp chemistry (the paper doesn't say).

My impression is that the reverse read is essentially useless because it would not overlap with the forward read after removing the low quality end. I tried a variety of trimming and truncating parameters with dada2 to no avail. The paper reports using a software called PEAR to pair the reads...

Proceeding with the forward read only I am able to get a decent number of sequences per sample from dada2 by setting --p-trunc-len 0. If I truncate the forward reads at 172 or 162 (to get a 150 or 140bp read, respectively, after removing the primer) none pass the filter. Interestingly, trimming the primers with dada2 instead of cutadapt results in ~28,000 more reads overall.

Based on this, I have a few general questions I would appreciate input on:

Could these quality scores possibly indicate that the fastq files might have corrupted during transfer to/from NCBI?
Would you trust results based on sequencing data with these quality scores?
Is there any metric after demux summarize and passing a denoising step (e.g. dada2) that you assess the quality of your data by?

Here is my process:

qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path manifest-HPC.csv \
  --output-path paired-end-demux.qza \
  --source-format PairedEndFastqManifestPhred33

Use cutadapt to remove primers followed by dada2

qiime cutadapt trim-paired \
	--i-demultiplexed-sequences paired-end-demux.qza \
	--p-cores 64 \
	--p-front-f GTGCCAGCMGCCGCGGTAA \
	--p-front-r GGACTACHVGGGTWTCTAAT \
	--o-trimmed-sequences trimmed-seqs.qza 

qiime dada2 denoise-single \
	--i-demultiplexed-seqs trimmed-seqs.qza  \
	--p-n-threads 0 \
	--output-dir dada2-single-trimmed-trunc-0 \
	--p-trunc-len 0 \
	--p-trim-left 0

Alternatively, trim the primers with dada2 (results in more reads passing)

qiime dada2 denoise-single \
	--i-demultiplexed-seqs paired-end-demux.qza \
	--p-n-threads 0 \
	--output-dir dada2-single-trunc-0-trim-22 \
	--p-trunc-len 0 \
	--p-trim-left 22

ben · November 27, 2019, 9:07pm

Yes, very strange, I don't know enough about the chemistry that results in the patterns in the forward/reverse quality charts.

From uploading these sequences into SRA myself, I don't think the process would corrupt the files? Most of what you can do is take the demultiplexed files straight from Illumina Basespace and upload them to NIH SRA. I agree with you though, I don't think this would past DADA2's quality filters.

Did you try reaching out to the manuscript authors to see if you can get a fastq.gz to test them yourself? Can you link the Bioproject?

ben · November 27, 2019, 9:14pm

By the way, staring at a ton of these MISEq quality graphs, these are very atypical for V4 runs. Usually, the V4 forward and reverse are great.