Perhaps a common scenario: there is a paper published in Nature related to your area of research with publicly available sequencing data you want to further investigate. But have you ever downloaded the sequencing data and found it was of unusually poor quality?
I downloaded published 16S V4 amplicon sequences off of the NCBI database for the ~220 samples from the experiment. I created a manifest file and imported the sequences into qiime2 (version 2018.4 due to availability on our HPC). I ran
demux summarize and got the weirdest quality plots I have ever seen.
grep to identify that the 515/806 primers are located 2 bases from the left end of both F and R reads. Therefore I'm assuming these reads are 150bp chemistry (the paper doesn't say).
My impression is that the reverse read is essentially useless because it would not overlap with the forward read after removing the low quality end. I tried a variety of trimming and truncating parameters with dada2 to no avail. The paper reports using a software called PEAR to pair the reads...
Proceeding with the forward read only I am able to get a decent number of sequences per sample from dada2 by setting
--p-trunc-len 0. If I truncate the forward reads at 172 or 162 (to get a 150 or 140bp read, respectively, after removing the primer) none pass the filter. Interestingly, trimming the primers with dada2 instead of cutadapt results in ~28,000 more reads overall.
Based on this, I have a few general questions I would appreciate input on:
- Could these quality scores possibly indicate that the fastq files might have corrupted during transfer to/from NCBI?
- Would you trust results based on sequencing data with these quality scores?
- Is there any metric after
demux summarizeand passing a denoising step (e.g. dada2) that you assess the quality of your data by?
Here is my process:
qiime tools import \ --type 'SampleData[PairedEndSequencesWithQuality]' \ --input-path manifest-HPC.csv \ --output-path paired-end-demux.qza \ --source-format PairedEndFastqManifestPhred33
Use cutadapt to remove primers followed by dada2
qiime cutadapt trim-paired \ --i-demultiplexed-sequences paired-end-demux.qza \ --p-cores 64 \ --p-front-f GTGCCAGCMGCCGCGGTAA \ --p-front-r GGACTACHVGGGTWTCTAAT \ --o-trimmed-sequences trimmed-seqs.qza qiime dada2 denoise-single \ --i-demultiplexed-seqs trimmed-seqs.qza \ --p-n-threads 0 \ --output-dir dada2-single-trimmed-trunc-0 \ --p-trunc-len 0 \ --p-trim-left 0
Alternatively, trim the primers with dada2 (results in more reads passing)
qiime dada2 denoise-single \ --i-demultiplexed-seqs paired-end-demux.qza \ --p-n-threads 0 \ --output-dir dada2-single-trunc-0-trim-22 \ --p-trunc-len 0 \ --p-trim-left 22