Losing high percentage of reads with good quality scores

ZHY · March 30, 2022, 8:14am

I have the same problem. And I try to increase the value of max-ee, but it don't improve. I don't know how to do? My data have been dealt with.

Keegan-Evans · April 5, 2022, 4:49pm

@ZHY,

Can you post all of the steps that you have performed on your data as well as the command you are using to denoise with? Also, it would be helpful if you could you also create and post some quality plots of your demultiplexed data using demux summarize(docs).

Wieneke · April 6, 2022, 7:12am

I got the data demultiplexed: importing and the result
qiime tools import --type 'SampleData[PairedEndSequencesWithQuality]' --input-format PairedEndFastqManifestPhred33V2 --input-path ./HBSmanifest.tsv --output-path ./HBSpaired-end-demux.qza

HBSpaired-end-demux.qzv (316.8 KB)
and the the denoising:
qiime dada2 denoise-paired --i-demultiplexed-seqs HBSpaired-end-demux.qza --p-trim-left-f 18 --p-trim-left-r 21 --p-trunc-len-f 250 --p-trunc-len-r 232 --o-table tableHBS.qza --o-representative-sequences rep-seqs-HBS.qza --o-denoising-stats denoising-statsHBS.qza --verbose

Wieneke · April 6, 2022, 7:31am

Sorry I seev now It was @ZHY you asked the data from. I don't know how I can delete this post.

ZHY · April 6, 2022, 8:01am

The LH-7 is my single sample.
LH_7_dada2_data.qzv (345.2 KB)
The rdl_data is all samples.
rdl_data.qzv (345.9 KB)

ZHY · April 6, 2022, 8:01am

For leave more reads, I try to leave more chimeric sequence, but it isn't useful.
qiime dada2 denoise-single
--i-demultiplexed-seqs /home/rendongliang/dada2_data.qza
--p-trim-left 0
--p-trunc-len 0
--p-max-ee 20
--p-chimera-method none
--o-table dada2/table.qza
--o-representative-sequences dada2/rep-seqs.qza
--o-denoising-stats dada2/denoising-stats.qza

And my single sample can leave half of the sequence reads, when it in all samples, it can't do that.
leave chimeric sequences（all samples）
LH-7 21580 21554 99.88 2079 2079 9.63
leave chimeric sequences（single sample）
LH-7 21580 19280 89.34 11995 11995 55.58
remove chimeric sequences（single sample）
LH-7 21580 19280 89.34 11995 6928 32.1

Keegan-Evans · April 11, 2022, 5:14pm

@ZHY,

Could you post your import steps as well? looking at the visualizations, it looks like the PHRED scores are not being processed correctly, which I think we will be able to fix during import.

ZHY · April 14, 2022, 9:12am

For my data, I don't use too many data treatings, because the returned data of Sequencing company has been taken care of before import.
So my import method as follows:
qiime tools import \

--input-path manifest.csv \

--type SampleData[SequencesWithQuality] **

--input-format SingleEndFastqManifestPhred33 **

--output-path rdl_data.qza**

Keegan-Evans · April 14, 2022, 7:01pm

@ZHY,

Try importing again, first with the --input-format set to SingleEndFastqManifestPhred33V2 and if that does not work try SingleEndFastqManifestPhred64V2.

ZHY · April 15, 2022, 7:46am

I try the two ways, but they don't work, and SingleEndFastqManifestPhred33 is the ueseful importing way.

Keegan-Evans · April 15, 2022, 3:33pm

@ZHY,

Gahh, I forgot that the V2 signifies how the manifest file is built, try keeping everything else the same but changing the 33 to 64. It still may not work but lets try all the options here.

ZHY · April 18, 2022, 11:19am

Yes, it do not work and provide a error information in the picture.

I think the ultimate reason that maybe my data really not good.

ZHY · April 19, 2022, 11:47am

Thanks for your help. For my data, I can observe multiple sequences with at least 90% similarity, and part of thses sequences that quality score only have single nucleotide differences. But I not sure this is main reason.

lizgehret · April 21, 2022, 6:56pm

An off-topic reply has been split into a new topic: Losing high percentage of reads in dada2 denoise-paired

Please keep replies on-topic in the future.

Keegan-Evans · April 28, 2022, 11:22pm

@ZHY,

I was able to take a closer look at your data and it looks like it may not have been collected on an Illumina machine, based on the values of the PHRED scores present. The scores you have in your data contain a wider range of values than would be present in a single Illumina variant on its own, as well as having longer reads than would be expected with the standard Illumina reads. Do you know what technology was used to sequence your data? If not could you ask your sequencing center?

ZHY · May 3, 2022, 2:13am

Hi, thanks for your reply, I ask the sequencing center about the sequencing methods. The company uses Pacbio SMRT to sequence my data.