Hi, I am having some issues with the majority of my 16s full length PacBio reads being reported as chimeric by the dada2 denoise-ccs step.
I am using Qiime2 v2023.7 and the command I am using is below.
qiime dada2 denoise-ccs --p-chimera-method none --p-min-fold-parent-over-abundance 99999999 --p-min-len 1000 --p-max-len 1600 --i-demultiplexed-seqs demux-pacbio.qza --p-front AGRGTTYGATYMTGGCTCAG --p-adapter RGYTACCTTGTTACGACTT --p-n-threads 12 --o-representative-sequences rep-seqs_pacbio_mod.qza --o-denoising-stats denoising-stats_pacbio_mod.qza --o-table table_pacbio_mod.qza --verbose > denoising_error_mod.txt 2>&1
As you can see, while the number of reads that pass the QC step is high, the number reported as chimeric is often >=90%. This is a rumen sample which I believe can play a part due to the DNA quality and environment complexity, however.. The below results are what I get after using the '--p-chimera-method none' parameter. Why is chimera detection still being done with this parameter turned off? I even tried with ' --p-min-fold-parent-over-abundance' set to 999999999 as seen in other guides.
sample-id input primer-removed percentage of input primer-removed filtered percentage of input passed filter denoised non-chimeric percentage of input non-chimeric
#q2:types numeric numeric numeric numeric numeric numeric numeric numeric
m64209e_240116_115502.bc1005--bc1033 477 420 88.05 272 57.02 27 27 5.66
m64209e_240116_115502.bc1005--bc1035 1615 1477 91.46 1075 66.56 177 177 10.96
m64209e_240116_115502.bc1005--bc1044 17829 13198 74.03 4791 26.87 388 388 2.18
m64209e_240116_115502.bc1005--bc1045 28027 19317 68.92 7248 25.86 575 575 2.05
This is the end of the denoising_error_mod.txt log file.
Read in 48398, output 36145 (74.7%) filtered sequences.
9471 sequences out of 23581 are being reverse-complemented.
Read in 23581, output 17010 (72.1%) filtered sequences.
13004 sequences out of 30001 are being reverse-complemented.
Read in 30001, output 23159 (77.2%) filtered sequences.
............................................
2) Filtering ............................................
3) Learning Error Rates
368911829 total bases in 253424 reads from 44 samples will be used for learning the error rates.
4) Denoise samples
............................................
5) Remove chimeras (method = none)
6) Report read numbers through the pipeline
7) Write output
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.
Command: run_dada.R --input_directory /var/folders/y_/p9fxs7c92t93sf12wkkm7nt40000gq/T/qiime2/nicholas/data/66c54186-b1d0-4e4b-8c70-9c613eaed452/data --output_path /var/folders/y_/p9fxs7c92t93sf12wkkm7nt40000gq/T/tmpdxa3wjgo/output.tsv.biom --output_track /var/folders/y_/p9fxs7c92t93sf12wkkm7nt40000gq/T/tmpdxa3wjgo/track.tsv --removed_primer_directory /var/folders/y_/p9fxs7c92t93sf12wkkm7nt40000gq/T/tmpdxa3wjgo/nop --filtered_directory /var/folders/y_/p9fxs7c92t93sf12wkkm7nt40000gq/T/tmpdxa3wjgo/filt --forward_primer AGRGTTYGATYMTGGCTCAG --reverse_primer RGYTACCTTGTTACGACTT --max_mismatch 2 --indels False --truncation_length 0 --trim_left 0 --max_expected_errors 2.0 --truncation_quality_score 2 --min_length 1000 --max_length 1600 --pooling_method independent --chimera_method none --min_parental_fold 99999999 --allow_one_off False --num_threads 12 --learn_min_reads 1000000 --homopolymer_gap_penalty NULL --band_size 32
Thank you.
Hi @Nicholas_Dimonaco ,
Welcome to the forum!
If I read your report correctly, there are not any reads being removed as chimera, as would be expected when this is set as "none". The "percentage of input non-chimeric" should rather be interpreted as the number of output reads / input reads * 100.
You appear to be losing a large number of reads at the filtering step, so adjusting your trimming and filtering parameters may be one solution.
Most reads appear to be lost at the denoising step, however. This is usually seen when there is either 1. a bad error model, or 2. low input sequence quality. I would start with #2, since it sounds like the filtering parameters are suboptimal already. You should check your read quality profiles to select appropriate trimming lengths. If you still see many reads lost at the denoising step, you should inspect the error models, which cannot be done with QIIME 2 at the moment but could be done with the dada2 R package.
Good luck!
Hi @Nicholas_Bokulich .
Thanks for the quick response. I took a few days to dig around my results from different samples to make sure I was getting similar output and that I understood it all correctly.
It looks like depending on the type of sample and possibly extraction kits, while I do get some variation, the number of sequences removed due to what I now understand to be because of the denoising step, is extremely high. Some samples are around 50% and others are ~90-99%.
The 'better' performing samples were reported from the sequencer with a PHRED score of >=Q36 and the lower performing samples were all ~Q25 which may explain the results I am getting.
I have read other posts that mention similarly high levels of 'sequence-removal' due to denoising but there is not too much information specific to PacBio 16s sequences. Do you have any suggestions on parameter changes I could try?
I tried to increase '--p-max-ee' to 5,10,20 with little-to-no changes.
Obviously, I need to be careful when weighing up the proportion of reads I allow through and their quality but the samples that have >=90% of their reads removed seem to be unusable at this stage.
Any suggestions are welcome!
Thanks.
Hi @Nicholas_Dimonaco ,
I would start with more stringent read trimming/filtering, as the reads are probably being tossed out at the denoising step due to poor read quality. At what length are you trimming your reads? Could you share your demux summarize QZV here?
Increasing max-ee would probably lead to more reads being lost at denoising (instead of filtering) because you are passing more low-quality reads to the model training step. Again, you need to trim or filter out the low-quality sequences before they reach the denoising step.
The other option is to run dada2 directly in R, which allows you to inspect the model results. This is not yet an option in q2-dada2. I believe the dada2 R tutorial gives some explanation of the model result plots and how to interpret them. If more stringent filtering does not do the trick, I recommend trying this next.
Hi @Nicholas_Bokulich,
I have been in contact with PacBio and inspected the output from our sequencing runs and it does definitely look like that for a few samples we have low-quality reads (Average of Q25).
The below figure is from our worse sample and shows we have a few 'double length' sequences with very low error but still from this figure I can't see why our average PHRED score is Q25 as when I remove those reads, the rest look good according to this graph at least!
reads_pacbio_summary.qzv (345.2 KB)
denoising-stats_tab.qzv (1.2 MB)
I reran the denoise-ccs step with multiple different parameter variations of the following options and got no real discernable differences in the number of reads reaching the chimera detection step. Often ~99% of my reads are removed... It does look like we do have some issues with our sequencing data.
--p-indels
--p-trim-left 20/50/200
--p-trun-q 10/20/30
--min-len 1000/1200
--p-max-len 1500/16000
I took one of our 'good' samples (different biological environment) and ran it through the Illumina single-end denoiser and the ccs denoiser and found that fewer reads are returned using the ccs option (Initial numbers are different because the adapter trimming is done before the dada2 step for the Illumina run). I guess this is ok though as the ccs denoising should be returning more 'appropriate' reads even if a few less <- Does this sound right?
PacBio ccs
sample-id input filtered percentage of input passed filter denoised non-chimeric percentage of input non-chimeric
#q2:types numeric numeric numeric numeric numeric numeric
m64209e_230821_160440.bc1005--bc1033 18584 11958 64.35 11637 11419 61.45
m64209e_230821_160440.bc1005--bc1035 22476 14010 62.33 13711 13351 59.4
m64209e_230821_160440.bc1005--bc1044 10667 5493 51.5 4707 4691 43.98
m64209e_230821_160440.bc1005--bc1045 8916 5938 66.6 5431 5394 60.5
m64209e_230821_160440.bc1005--bc1054 13616 7467 54.84 6321 6217 45.66
m64209e_230821_160440.bc1005--bc1056 13469 6119 45.43 4503 4465 33.15
m64209e_230821_160440.bc1005--bc1057 22680 13697 60.39 13209 12571 55.43
m64209e_230821_160440.bc1005--bc1059 10187 7696 75.55 7366 7366 72.31
Illumina Single-end
sample-id input filtered percentage of input passed filter denoised non-chimeric percentage of input non-chimeric
#q2:types numeric numeric numeric numeric numeric numeric
m64209e_230821_160440.bc1005--bc1033.fastq.gz 15225 13224 86.86 12802 12538 82.35
m64209e_230821_160440.bc1005--bc1035.fastq.gz 17857 15510 86.86 15093 14400 80.64
m64209e_230821_160440.bc1005--bc1044.fastq.gz 8688 6621 76.21 5726 5706 65.68
m64209e_230821_160440.bc1005--bc1045.fastq.gz 7624 6459 84.72 6090 6021 78.97
m64209e_230821_160440.bc1005--bc1054.fastq.gz 11598 8827 76.11 7133 6998 60.34
m64209e_230821_160440.bc1005--bc1056.fastq.gz 11582 7863 67.89 5316 5312 45.86
m64209e_230821_160440.bc1005--bc1057.fastq.gz 18438 15334 83.17 14560 13732 74.48
m64209e_230821_160440.bc1005--bc1059.fastq.gz 9610 8431 87.73 8030 8030 83.56
I will look into using dada2 via R for more fine-tuning but as we do have some 'good' samples as shown above, I think we may be hitting a limit of what is possible with this dataset 
Thank you again for the clarity given,
Nick.
Hi @Nicholas_Dimonaco ,
Yes it looks like the quality is a bit mixed and more trimming/filtering would be needed if you want to try getting a larger number of shorter reads passing. So there are tradeoffs.
The subset of good samples that you show in the latest report do look very good. Many reads are lost, but this is of course the whole point — dada2 is doing its job 
Good luck!