'--p-chimera-method none' Still removing almost all reads due to being chimeric - 16s PacBio full length

Hi, I am having some issues with the majority of my 16s full length PacBio reads being reported as chimeric by the dada2 denoise-ccs step.

I am using Qiime2 v2023.7 and the command I am using is below.
qiime dada2 denoise-ccs --p-chimera-method none --p-min-fold-parent-over-abundance 99999999 --p-min-len 1000 --p-max-len 1600 --i-demultiplexed-seqs demux-pacbio.qza --p-front AGRGTTYGATYMTGGCTCAG --p-adapter RGYTACCTTGTTACGACTT --p-n-threads 12 --o-representative-sequences rep-seqs_pacbio_mod.qza --o-denoising-stats denoising-stats_pacbio_mod.qza --o-table table_pacbio_mod.qza --verbose > denoising_error_mod.txt 2>&1

As you can see, while the number of reads that pass the QC step is high, the number reported as chimeric is often >=90%. This is a rumen sample which I believe can play a part due to the DNA quality and environment complexity, however.. The below results are what I get after using the '--p-chimera-method none' parameter. Why is chimera detection still being done with this parameter turned off? I even tried with ' --p-min-fold-parent-over-abundance' set to 999999999 as seen in other guides.

sample-id       input   primer-removed  percentage of input primer-removed      filtered        percentage of input passed filter       denoised        non-chimeric    percentage of input non-chimeric
#q2:types       numeric numeric numeric numeric numeric numeric numeric numeric
m64209e_240116_115502.bc1005--bc1033    477     420     88.05   272     57.02   27      27      5.66
m64209e_240116_115502.bc1005--bc1035    1615    1477    91.46   1075    66.56   177     177     10.96
m64209e_240116_115502.bc1005--bc1044    17829   13198   74.03   4791    26.87   388     388     2.18
m64209e_240116_115502.bc1005--bc1045    28027   19317   68.92   7248    25.86   575     575     2.05

This is the end of the denoising_error_mod.txt log file.

Read in 48398, output 36145 (74.7%) filtered sequences.
9471 sequences out of 23581 are being reverse-complemented.
Read in 23581, output 17010 (72.1%) filtered sequences.
13004 sequences out of 30001 are being reverse-complemented.
Read in 30001, output 23159 (77.2%) filtered sequences.
............................................
2) Filtering ............................................
3) Learning Error Rates
368911829 total bases in 253424 reads from 44 samples will be used for learning the error rates.
4) Denoise samples 
............................................
5) Remove chimeras (method = none)
6) Report read numbers through the pipeline
7) Write output
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada.R --input_directory /var/folders/y_/p9fxs7c92t93sf12wkkm7nt40000gq/T/qiime2/nicholas/data/66c54186-b1d0-4e4b-8c70-9c613eaed452/data --output_path /var/folders/y_/p9fxs7c92t93sf12wkkm7nt40000gq/T/tmpdxa3wjgo/output.tsv.biom --output_track /var/folders/y_/p9fxs7c92t93sf12wkkm7nt40000gq/T/tmpdxa3wjgo/track.tsv --removed_primer_directory /var/folders/y_/p9fxs7c92t93sf12wkkm7nt40000gq/T/tmpdxa3wjgo/nop --filtered_directory /var/folders/y_/p9fxs7c92t93sf12wkkm7nt40000gq/T/tmpdxa3wjgo/filt --forward_primer AGRGTTYGATYMTGGCTCAG --reverse_primer RGYTACCTTGTTACGACTT --max_mismatch 2 --indels False --truncation_length 0 --trim_left 0 --max_expected_errors 2.0 --truncation_quality_score 2 --min_length 1000 --max_length 1600 --pooling_method independent --chimera_method none --min_parental_fold 99999999 --allow_one_off False --num_threads 12 --learn_min_reads 1000000 --homopolymer_gap_penalty NULL --band_size 32

Thank you.

Hi @Nicholas_Dimonaco ,
Welcome to the forum!

If I read your report correctly, there are not any reads being removed as chimera, as would be expected when this is set as "none". The "percentage of input non-chimeric" should rather be interpreted as the number of output reads / input reads * 100.

You appear to be losing a large number of reads at the filtering step, so adjusting your trimming and filtering parameters may be one solution.

Most reads appear to be lost at the denoising step, however. This is usually seen when there is either 1. a bad error model, or 2. low input sequence quality. I would start with #2, since it sounds like the filtering parameters are suboptimal already. You should check your read quality profiles to select appropriate trimming lengths. If you still see many reads lost at the denoising step, you should inspect the error models, which cannot be done with QIIME 2 at the moment but could be done with the dada2 R package.

Good luck!

1 Like

Hi @Nicholas_Bokulich .
Thanks for the quick response. I took a few days to dig around my results from different samples to make sure I was getting similar output and that I understood it all correctly.

It looks like depending on the type of sample and possibly extraction kits, while I do get some variation, the number of sequences removed due to what I now understand to be because of the denoising step, is extremely high. Some samples are around 50% and others are ~90-99%.
The 'better' performing samples were reported from the sequencer with a PHRED score of >=Q36 and the lower performing samples were all ~Q25 which may explain the results I am getting.

I have read other posts that mention similarly high levels of 'sequence-removal' due to denoising but there is not too much information specific to PacBio 16s sequences. Do you have any suggestions on parameter changes I could try?
I tried to increase '--p-max-ee' to 5,10,20 with little-to-no changes.

Obviously, I need to be careful when weighing up the proportion of reads I allow through and their quality but the samples that have >=90% of their reads removed seem to be unusable at this stage.

Any suggestions are welcome!

Thanks.

Hi @Nicholas_Dimonaco ,

I would start with more stringent read trimming/filtering, as the reads are probably being tossed out at the denoising step due to poor read quality. At what length are you trimming your reads? Could you share your demux summarize QZV here?

Increasing max-ee would probably lead to more reads being lost at denoising (instead of filtering) because you are passing more low-quality reads to the model training step. Again, you need to trim or filter out the low-quality sequences before they reach the denoising step.

The other option is to run dada2 directly in R, which allows you to inspect the model results. This is not yet an option in q2-dada2. I believe the dada2 R tutorial gives some explanation of the model result plots and how to interpret them. If more stringent filtering does not do the trick, I recommend trying this next.

1 Like

Hi @Nicholas_Bokulich,
I have been in contact with PacBio and inspected the output from our sequencing runs and it does definitely look like that for a few samples we have low-quality reads (Average of Q25).
The below figure is from our worse sample and shows we have a few 'double length' sequences with very low error but still from this figure I can't see why our average PHRED score is Q25 as when I remove those reads, the rest look good according to this graph at least!


reads_pacbio_summary.qzv (345.2 KB)
denoising-stats_tab.qzv (1.2 MB)

I reran the denoise-ccs step with multiple different parameter variations of the following options and got no real discernable differences in the number of reads reaching the chimera detection step. Often ~99% of my reads are removed... It does look like we do have some issues with our sequencing data.

--p-indels
--p-trim-left 20/50/200
--p-trun-q 10/20/30
--min-len 1000/1200
--p-max-len 1500/16000

I took one of our 'good' samples (different biological environment) and ran it through the Illumina single-end denoiser and the ccs denoiser and found that fewer reads are returned using the ccs option (Initial numbers are different because the adapter trimming is done before the dada2 step for the Illumina run). I guess this is ok though as the ccs denoising should be returning more 'appropriate' reads even if a few less <- Does this sound right?

PacBio ccs						
sample-id	input	filtered	percentage of input passed filter	denoised	non-chimeric	percentage of input non-chimeric
#q2:types	numeric	numeric	numeric	numeric	numeric	numeric
m64209e_230821_160440.bc1005--bc1033	18584	11958	64.35	11637	11419	61.45
m64209e_230821_160440.bc1005--bc1035	22476	14010	62.33	13711	13351	59.4
m64209e_230821_160440.bc1005--bc1044	10667	5493	51.5	4707	4691	43.98
m64209e_230821_160440.bc1005--bc1045	8916	5938	66.6	5431	5394	60.5
m64209e_230821_160440.bc1005--bc1054	13616	7467	54.84	6321	6217	45.66
m64209e_230821_160440.bc1005--bc1056	13469	6119	45.43	4503	4465	33.15
m64209e_230821_160440.bc1005--bc1057	22680	13697	60.39	13209	12571	55.43
m64209e_230821_160440.bc1005--bc1059	10187	7696	75.55	7366	7366	72.31
Illumina Single-end						
sample-id	input	filtered	percentage of input passed filter	denoised	non-chimeric	percentage of input non-chimeric
#q2:types	numeric	numeric	numeric	numeric	numeric	numeric
m64209e_230821_160440.bc1005--bc1033.fastq.gz	15225	13224	86.86	12802	12538	82.35
m64209e_230821_160440.bc1005--bc1035.fastq.gz	17857	15510	86.86	15093	14400	80.64
m64209e_230821_160440.bc1005--bc1044.fastq.gz	8688	6621	76.21	5726	5706	65.68
m64209e_230821_160440.bc1005--bc1045.fastq.gz	7624	6459	84.72	6090	6021	78.97
m64209e_230821_160440.bc1005--bc1054.fastq.gz	11598	8827	76.11	7133	6998	60.34
m64209e_230821_160440.bc1005--bc1056.fastq.gz	11582	7863	67.89	5316	5312	45.86
m64209e_230821_160440.bc1005--bc1057.fastq.gz	18438	15334	83.17	14560	13732	74.48
m64209e_230821_160440.bc1005--bc1059.fastq.gz	9610	8431	87.73	8030	8030	83.56

I will look into using dada2 via R for more fine-tuning but as we do have some 'good' samples as shown above, I think we may be hitting a limit of what is possible with this dataset :frowning:

Thank you again for the clarity given,
Nick.

1 Like

Hi @Nicholas_Dimonaco ,
Yes it looks like the quality is a bit mixed and more trimming/filtering would be needed if you want to try getting a larger number of shorter reads passing. So there are tradeoffs.

The subset of good samples that you show in the latest report do look very good. Many reads are lost, but this is of course the whole point — dada2 is doing its job :grin:

Good luck!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.