Hi, I am having some issues with the majority of my 16s full length PacBio reads being reported as chimeric by the dada2 denoise-ccs step.
I am using Qiime2 v2023.7 and the command I am using is below. qiime dada2 denoise-ccs --p-chimera-method none --p-min-fold-parent-over-abundance 99999999 --p-min-len 1000 --p-max-len 1600 --i-demultiplexed-seqs demux-pacbio.qza --p-front AGRGTTYGATYMTGGCTCAG --p-adapter RGYTACCTTGTTACGACTT --p-n-threads 12 --o-representative-sequences rep-seqs_pacbio_mod.qza --o-denoising-stats denoising-stats_pacbio_mod.qza --o-table table_pacbio_mod.qza --verbose > denoising_error_mod.txt 2>&1
As you can see, while the number of reads that pass the QC step is high, the number reported as chimeric is often >=90%. This is a rumen sample which I believe can play a part due to the DNA quality and environment complexity, however.. The below results are what I get after using the '--p-chimera-method none' parameter. Why is chimera detection still being done with this parameter turned off? I even tried with ' --p-min-fold-parent-over-abundance' set to 999999999 as seen in other guides.
This is the end of the denoising_error_mod.txt log file.
Read in 48398, output 36145 (74.7%) filtered sequences.
9471 sequences out of 23581 are being reverse-complemented.
Read in 23581, output 17010 (72.1%) filtered sequences.
13004 sequences out of 30001 are being reverse-complemented.
Read in 30001, output 23159 (77.2%) filtered sequences.
............................................
2) Filtering ............................................
3) Learning Error Rates
368911829 total bases in 253424 reads from 44 samples will be used for learning the error rates.
4) Denoise samples
............................................
5) Remove chimeras (method = none)
6) Report read numbers through the pipeline
7) Write output
Running external command line application(s). This may print messages to stdout and/or stderr.
The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.
Command: run_dada.R --input_directory /var/folders/y_/p9fxs7c92t93sf12wkkm7nt40000gq/T/qiime2/nicholas/data/66c54186-b1d0-4e4b-8c70-9c613eaed452/data --output_path /var/folders/y_/p9fxs7c92t93sf12wkkm7nt40000gq/T/tmpdxa3wjgo/output.tsv.biom --output_track /var/folders/y_/p9fxs7c92t93sf12wkkm7nt40000gq/T/tmpdxa3wjgo/track.tsv --removed_primer_directory /var/folders/y_/p9fxs7c92t93sf12wkkm7nt40000gq/T/tmpdxa3wjgo/nop --filtered_directory /var/folders/y_/p9fxs7c92t93sf12wkkm7nt40000gq/T/tmpdxa3wjgo/filt --forward_primer AGRGTTYGATYMTGGCTCAG --reverse_primer RGYTACCTTGTTACGACTT --max_mismatch 2 --indels False --truncation_length 0 --trim_left 0 --max_expected_errors 2.0 --truncation_quality_score 2 --min_length 1000 --max_length 1600 --pooling_method independent --chimera_method none --min_parental_fold 99999999 --allow_one_off False --num_threads 12 --learn_min_reads 1000000 --homopolymer_gap_penalty NULL --band_size 32
If I read your report correctly, there are not any reads being removed as chimera, as would be expected when this is set as "none". The "percentage of input non-chimeric" should rather be interpreted as the number of output reads / input reads * 100.
You appear to be losing a large number of reads at the filtering step, so adjusting your trimming and filtering parameters may be one solution.
Most reads appear to be lost at the denoising step, however. This is usually seen when there is either 1. a bad error model, or 2. low input sequence quality. I would start with #2, since it sounds like the filtering parameters are suboptimal already. You should check your read quality profiles to select appropriate trimming lengths. If you still see many reads lost at the denoising step, you should inspect the error models, which cannot be done with QIIME 2 at the moment but could be done with the dada2 R package.
Hi @Nicholas_Bokulich .
Thanks for the quick response. I took a few days to dig around my results from different samples to make sure I was getting similar output and that I understood it all correctly.
It looks like depending on the type of sample and possibly extraction kits, while I do get some variation, the number of sequences removed due to what I now understand to be because of the denoising step, is extremely high. Some samples are around 50% and others are ~90-99%.
The 'better' performing samples were reported from the sequencer with a PHRED score of >=Q36 and the lower performing samples were all ~Q25 which may explain the results I am getting.
I have read other posts that mention similarly high levels of 'sequence-removal' due to denoising but there is not too much information specific to PacBio 16s sequences. Do you have any suggestions on parameter changes I could try?
I tried to increase '--p-max-ee' to 5,10,20 with little-to-no changes.
Obviously, I need to be careful when weighing up the proportion of reads I allow through and their quality but the samples that have >=90% of their reads removed seem to be unusable at this stage.
I would start with more stringent read trimming/filtering, as the reads are probably being tossed out at the denoising step due to poor read quality. At what length are you trimming your reads? Could you share your demux summarize QZV here?
Increasing max-ee would probably lead to more reads being lost at denoising (instead of filtering) because you are passing more low-quality reads to the model training step. Again, you need to trim or filter out the low-quality sequences before they reach the denoising step.
The other option is to run dada2 directly in R, which allows you to inspect the model results. This is not yet an option in q2-dada2. I believe the dada2 R tutorial gives some explanation of the model result plots and how to interpret them. If more stringent filtering does not do the trick, I recommend trying this next.
Hi @Nicholas_Bokulich,
I have been in contact with PacBio and inspected the output from our sequencing runs and it does definitely look like that for a few samples we have low-quality reads (Average of Q25).
The below figure is from our worse sample and shows we have a few 'double length' sequences with very low error but still from this figure I can't see why our average PHRED score is Q25 as when I remove those reads, the rest look good according to this graph at least!
I reran the denoise-ccs step with multiple different parameter variations of the following options and got no real discernable differences in the number of reads reaching the chimera detection step. Often ~99% of my reads are removed... It does look like we do have some issues with our sequencing data.
I took one of our 'good' samples (different biological environment) and ran it through the Illumina single-end denoiser and the ccs denoiser and found that fewer reads are returned using the ccs option (Initial numbers are different because the adapter trimming is done before the dada2 step for the Illumina run). I guess this is ok though as the ccs denoising should be returning more 'appropriate' reads even if a few less <- Does this sound right?
I will look into using dada2 via R for more fine-tuning but as we do have some 'good' samples as shown above, I think we may be hitting a limit of what is possible with this dataset
Hi @Nicholas_Dimonaco ,
Yes it looks like the quality is a bit mixed and more trimming/filtering would be needed if you want to try getting a larger number of shorter reads passing. So there are tradeoffs.
The subset of good samples that you show in the latest report do look very good. Many reads are lost, but this is of course the whole point — dada2 is doing its job