DADA2 pairwise alignments parameter tuning

salias · May 16, 2024, 2:45pm

I'm using DADA2 plugin in fungal ITS amplicon data. Reading the DADA2 paper I found that, in the first part of the algorithm (pairwise alignments) there are two options (KDIST_CUTOFF and BAND_SIZE) that control the heuristics of the alignments. In the paper they say that "default values should be re-examined if the algorithm is applied to genetic regions with significantly different characteristics, such as the indel-rich ITS region".

My question is: is there any way in QIIME2 to provide DADA2 with my own values of those two parameters?

colinvwood · May 17, 2024, 5:22pm

Hello @salias,

Unfortunately, it doesn't look like those parameters are available through qiime2. If you think there's a strong argument for exposing these parameters you can open an issue here. The dada2 tool has so many parameters spread across its pipeline that when it was wrapped in qiime2 I think that there was a balance that needed to be struck between configurability and ease of use/interpretability. If you are comfortable with R, you can use the package directly and have control over any and all parameters that dada2 provides.

salias · May 20, 2024, 7:42am

Hi @colinvwood ,

Thank you for your response. Yes, I think the same: there are so many parameters and exposing all of them would me a headache for the user. I really don't think these parameters are "important" enough for them to be available in q2-dada2 (because for example for 16S they seem to be sensible values already with the defaults), so I'll follow your advice and work with DADA2 in R and them importing to qiime2.

Nicholas_Bokulich · May 21, 2024, 10:57am

hi @salias hi @colinvwood ,

On the other hand, if these parameters are important for ITS or other non-16S targets then it may be important to expose these to users but set them to reasonable defaults. Many users are using QIIME 2 to analyze ITS and other non-16S targets, so we need to consider their needs as well. So let's not rule this out just yet...

This sounds quite hypothetical — they should be re-examined, but it sounds like the appropriate settings are not known and would need to be explored. @salias if you plan to do so with your ITS data, perhaps you could report your findings back here? Then we could decide if these parameters should be exposed in q2-dada2. Ideally, some benchmarking should be done with a ground truth (e.g., mock community or other positive control) to tune these, if possible.

Thanks!

salias · May 21, 2024, 11:36am

Hello @Nicholas_Bokulich ,

Thank you for your response.

That's true (I could serve as a example). The thing is that, as you say:

So maybe there is no need to fine-tune these parameters and we are just overthinking it. The only way to know if it is worth it is, as you say:

So what I could do is:

Get mock data e.g. from mockrobiota, as they do in the Fungal ITS analysis tutorial
Follow tutorial until the denoising step.
Export sequences, then use DADA2 in R and try combinations of a range of values of KDIST_CUTOFF and BAND_SIZE.
Go back to QIIME2, and do taxonomic classificiation for each test
Evaluate accuracy and see if best combinations are different enough from default values

I'm currently focusing in my ITS QIIME2 Snakemake pipeline but I can spend some time to do the benchmarking and then share my findings here. If we spot some improvements by changing the default values of those parameters, I could even try to do a pull request to the q2-dada2 GitHub repository, although I would need to do some research on plugin creation, structure and philosophy.

Best wishes

Nicholas_Bokulich · May 21, 2024, 12:03pm

Sounds great @salias , when you find the time to test this please let us know what you find!

salias · May 31, 2024, 10:58am

Hi again,

I'm doing the benchmarking and facing a problem. I'm using the mock ITS community of Bakker, 2018 (3 replicates of a Even mock community, 3 replicates of a Staggered A and 3 replicates of a Staggered B, Standard PCR conditions). I manually created a TSV (attached, expected-taxonomy.tsv) for the expected taxonomic composition using tables of the paper and the taxonomy as in UNITE last version (sh_general_release_dynamic_s_all_04.04.2024.fasta). I could annotate until species level in all species except for one (Candida apicola), because I did not find it in the UNITE file and for which I only put information until g__Candida level. I converted TSV to BIOM and BIOM to QZA following this part of the fungal ITS tutorial:

Fungal ITS analysis tutorial

Convert to biom and import

biom convert \
  -i expected-taxonomy-mod.tsv \
  -o expected-taxonomy.biom \
  --table-type="OTU table" \
  --to-json
qiime tools import \
 --type FeatureTable[RelativeFrequency] \
 --input-path expected-taxonomy.biom \
 --input-format BIOMV100Format \
 --output-path expected-taxonomy.qza

Then I used the DADA2 ITS tutorial (the only difference with the general DADA2 tutorial is the use of Cutadapt to remove primers and their reverse complements to prevent read-through). Once I ran Cutadapt and DADA2, I assigned taxonomy direcly with the native Naive Bayes implementation of DADA2. For that I used the exact same FASTA file I used for manually creating the expected composition (sh_general_release_dynamic_s_all_04.04.2024.fasta). Then I exported the ASV table as BIOM and the taxonomy as a text file, and converted both to QZA format following instructions in the tutorial Importing dada2 and Phyloseq objects to QIIME 2.

Then I followed the rest of the Fungal ITS analysis tutorial: I collapsed the table to the species level, and converted it to relative frequencies. But the evaluate-composition step keeps failing:

Plugin error from quality-control:

  min() arg is an empty sequence

Looking for someone facing the same error in the forum, I found e.g. here that they got the error because the expected composition table was not correctly built, and also because they used a different database for the observed and expected. However, I think my expected table is correctly built, and the database I used for the taxonomic assignation and for building the expected table is the same. I don't know where I am failing here. Any ideas? Thanks in advance

Best
expected-taxonomy.qza (8.8 KB)
expected-taxonomy.tsv (4.6 KB)
feature-table-relative.qza (27.1 KB)

Nicholas_Bokulich · May 31, 2024, 12:23pm

Hi @salias ,

I think this should be Starmarella apicola, which is present in UNITE.

Your error is possibly because your expected taxonomy contains uneven levels (i.e., because of the Candida you listed). If fixing that does not fix this, please use the --verbose flag and report the full error message here.

Good luck!

salias · May 31, 2024, 2:32pm

Hello ,

I corrected all my genus level annotations with the corresponding synonyms present in UNITE. I thought uneven levels could be used because there are two genus level taxonomies in the file used in the ITS tutorial, but now I'm seeing the sed parameter that corrects that. My bad, I should have scrolled the code box before

My new expected taxonomy file is attached. I converted it to QZA and sadly I'm facing the same error. The verbose output:

Traceback (most recent call last):
  File "/path/to/conda/env/lib/python3.8/site-packages/q2cli/commands.py", line 520, in __call__
    results = self._execute_action(
  File "/path/to/conda/env/lib/python3.8/site-packages/q2cli/commands.py", line 581, in _execute_action
    results = action(**arguments)
  File "<decorator-gen-642>", line 2, in evaluate_composition
  File "/path/to/conda/env/lib/python3.8/site-packages/qiime2/sdk/action.py", line 342, in bound_callable
    outputs = self._callable_executor_(
  File "/path/to/conda/env/lib/python3.8/site-packages/qiime2/sdk/action.py", line 615, in _callable_executor_
    ret_val = self._callable(output_dir=temp_dir, **view_args)
  File "/path/to/conda/env/lib/python3.8/site-packages/q2_quality_control/quality_control.py", line 77, in evaluate_composition
    results = _evaluate_composition(
  File "/path/to/conda/env/lib/python3.8/site-packages/q2_quality_control/_utilities.py", line 127, in _evaluate_composition
    score_plot = _pointplot_multiple_y(
  File "/path/to/conda/env/lib/python3.8/site-packages/q2_quality_control/_utilities.py", line 281, in _pointplot_multiple_y
    sns.pointplot(data=results, x=xval, y=score, ax=axes, color=color)
  File "/path/to/conda/env/lib/python3.8/site-packages/seaborn/categorical.py", line 2839, in pointplot
    plotter = _PointPlotter(x, y, hue, data, order, hue_order,
  File "/path/to/conda/env/lib/python3.8/site-packages/seaborn/categorical.py", line 1603, in __init__
    self.establish_colors(color, palette, 1)
  File "/path/to/conda/env/lib/python3.8/site-packages/seaborn/categorical.py", line 707, in establish_colors
    lum = min(light_vals) * .6
ValueError: min() arg is an empty sequence

Plugin error from quality-control:

  min() arg is an empty sequence

See above for debug info.

I attach the files I modified (the feature table is the same as before)

expected-taxonomy-corrected.tsv (4.6 KB)
expected-taxonomy-corrected.qza (8.8 KB)

Nicholas_Bokulich · May 31, 2024, 2:50pm

This error is coming from seaborn, the plotting package used by q2-qc. It is implying that a color scheme cannot be generated, which can happen if there are no values to plot.

Why would there be no values to plot? I suspect that there are no matches between the taxonomies OR the sample IDs.

So I took a closer look at the taxonomies, and even though there are matches at higher taxonomic levels, it looks like the species labels are reported differently, as if you used different databases for this. The expected taxonomy table has the genus and species reported at the species rank, but the observed taxonomy shows only the species name. So this is important and you should fix it, but it probably is not big enough to cause an error like you are seeing.

Bigger issue: the sample IDs do not match. E.g., you have "FMockA1_1.fastq.gz" in the observed, but "FMockA1" in the expected.

salias · May 31, 2024, 4:04pm

Good news @Nicholas_Bokulich :

It works now!

Sorry, it should have occurred to me to stop and look inside the observed feature table before coming here just to paste errors. Not very bioinformatic on my part

The samples did not match because I assumed that the standalone DADA2 trimmed the sample name from the FASTQ file name. Strange thing to assume, I know, but the thing is that until now I have always used DADA2 inside QIIME2, so I have always imported my sequences with a manifest file where I explicitly named the samples that way. Again, my mistake.

What I don't quite understand is the taxonomy issue. I mean, I understand the issue itself (the command is not going to crash because the file format is valid, but I'm not going to have species-level matches). What I don't understand is why the DADA2 taxonomy doesn't include it. I have used the same file (I only have one UNITE FASTA file on this machine), and if I search for the species with grep (which is how I have made the expected file: searching, copying and pasting), I see this format:

$ cat sh_general_release_dynamic_s_all_04.04.2024.fasta | grep s__Starmerella_apicola | cut -d "|" -f 5 | sort | uniq

Output (with genus in species):

k__Fungi;p__Ascomycota;c__Saccharomycetes;o__Saccharomycetales;f__Trichomonascaceae;g__Starmerella;s__Starmerella_apicola

Maybe DADA2's assignTaxonomy() function removes the genus part in the species section under the hood. I think that could be possible since DADA2 recognizes that my FASTA file is from UNITE (the function writes to output this: "UNITE fungal taxonomic reference detected"). Anyway, I corrected the expected file by removing the gender from the s__ part.

I attach the QZV just in case you want to see. That one is from a test run where I used default parameters: KDIST_CUTOFF = 0.42 and BAND_SIZE = 16. I still need to figure out what the output metrics mean, but I'll study them while the rest of the benchmarking is running. My plan is:

Reading the documentation, figure out if parameters should be increased or decreased for ITS, and how much.
Select, apart from defaults, 3-4 more values for each one.
Run all possible combinations.
Do all the above changing maxEE and truncQ parameters from 2 defaults to 8, since Rolling et al., 2022 state that those values are better for ITS. I want to take the opportunity to see how those parameters interact with the others.

Finally, possible limitations of my benchmarking based on the expected composition table (so everyone can take this into account when I finish):

Where the paper says Candida apicola, I say Starmerella apicoa (I think this is not going to be an issue)
Where the paper says Mortierella verticillata (basionym of Podila verticillata) I say Mortierella sp (neither Mortierella verticillata or Podila verticillata was found in my UNITE file, Mortierella at least had a Genus sp entry)
Where the paper says Neosartorya fischeri, I say its homotypic synonym Aspergillus fischeri (in this case my UNITE file has a Neosartorya sp entry but I think the synonym is more accurate)

When I finish the benchmarking I'll come back to post the results

Again, thank you so much

evaluation.qzv (367.0 KB)

Nicholas_Bokulich · May 31, 2024, 6:10pm

Glad you could get it working @salias !

Yes sounds like that must be the case

Hm I am skeptical of this. I would expect the number of expected errors to depend more on the run quality and sequencing platform than on the marker gene itself, so it will be interesting to see how this compares to your results with a different mock community on a different sequencing run. Another thing to try benchmarking... do not fall into a hole

Good luck with the next steps! I am curious to see what you find!

salias · June 11, 2024, 10:04am

Hi,

I finished with the benchmarking. I was very hesitant about how to present the results, and in the end I thought that the easiest thing to do for now would be to simply put all the summary graphs in an Excel file (that made me think: perhaps it is worthwhile for me to sit down, think carefully about these results and actually write a manuscript on this benchmarking?). I attach the Excel file at the end of my reply (ended in .qza because otherwise the upload tool complains), but now some comments on results :

First of all, what is BAND_SIZE and KDIST_CUTOFF? According to DADA2 documentation:

BAND_SIZE: when set, banded Needleman-Wunsch alignments are performed. Banding restricts the net cumulative number of insertion of one sequence relative to the other. The default value of BAND_SIZE is 16. If DADA is applied to sequencing technologies with high rates of indels, such as 454 sequencing, the BAND_SIZE parameter should be increased. Setting BAND_SIZE to a negative number turns off banding (i.e. full Needleman-Wunsch).
KDIST_CUTOFF: A 5-mer distance screen is performed prior to performing each pairwise
alignment, and if the 5mer-distance is greater than KDIST_CUTOFF, no alignment is performed. The default value of 0.42 was chosen to screen pairs of sequences that differ by >10%, and was calibrated on Illumina sequenced 16S amplicon data. The assumption is that sequences that differ by such a large amount cannot be linked by amplicon errors (i.e. if you sequence one, you won’t get a read of other) and so careful (and costly) alignment is unnecessary.

What I understood from there: BAND_SIZE in theory should be increased for ITS (I wasn't sure for KDIST_CUTOFF). So, the values I finally benchmarked (default values are highlighted):

BAND_SIZE: 12, 16, 20, 24, 28
KDIST_CUTOFF: 0.34, 0.42, 0.50

All 15 possible combinations were run, and then another 15 changing maxEE and truncQ values from 2 (default) to 8.

What I can tell visually from the plots:

TAR and TDR values are relatively good with defaults.
Nearly always, changing the defaults makes Plantae kingdom to appear (O/E taxa is 2 at Kingdom level).
For equal values of BAND_SIZE, reducing KDIST_CUTOFF has no effect in TAR and TDR (except the case where BAND_SIZE is the default, where TAR and TDR are worse in general, but in Species level is better).
For equal values of BAND_SIZE, increasing KDIST_CUTOFF shows same behaviour as in previous point.
For equal values of KDIST_CUTOFF, decreasing BAND_SIZE doesn't show improvements for last taxonomic levels.
For equal values of KDIST_CUTOFF, increasing BAND_SIZE doesn't show improvements either.
Changing maxEE and truncQ from 2 to 8 has no effect.

Maybe I should look for something more specific in order to extract conclusions, but it's my first time evaluating accuracy so I don't really know which things I should give more importance to when interpreting the results.

Sorry for the large amount of information

Best wishes

mock_results.xlsx.qza (1012.1 KB)

Nicholas_Bokulich · June 11, 2024, 10:21am

Hi @salias ,
Thanks for sharing these interesting results!

It is difficult to compare in the plots you sent, as separate plots are shown per setting instead of per metric. Per metric would make side-by-side comparison easier.

I would add Bray-Curtis as a metric and drop observed/expected taxa.

But so far it looks like the defaults are maybe already more or less optimal? Though maxEE=2, band_size=12, kdist=0.34 is looking interesting.

salias · June 11, 2024, 10:35am

Hi @Nicholas_Bokulich ,

Yes, it looks that they are more or less optimal for ITS too, so maybe there is no need to make them custom through q2-dada2 (that was the original issue I had).

Yes I agree, I simply used plots already in the QZV because I don't have much time, but maybe if I decide to make this a full paper I should re-arrange plots like you say. For now I need to focus on other things but I'll come back with this side project as soon as I find more time.

This surprised me a lot, since reducing BAND_SIZE should actually be worse for indel-rich amplicons according to DADA2 manual. Maybe I'm missing something here...

Again, thanks a lot for all the support

Sergio

benjjneb · June 12, 2024, 1:04am

Hi @salias
Thanks for your investigations in this thread. I can add a little bit of additional information:

DADA2 is trying to model the error process, not the evolutionary process. The ITS region is indel-rich because there is much less evolutionary constraint against indels. But that doesn't bother DADA2 necessarily -- evolutionary differences by a more complex (than substitutions) indel process are still differences, and so should result in different ASVs. Indels can become an issue for DADA2 when they are introduced by PCR or sequencing, as those are the errors DADA2 is trying to correct to the true sequences. That is why pyrosequencing has that recommendation of increased alignment band width, to allow for cases where large indel errors are introduced through the homopolymer issues associated with pyrosequencing.

In practice, we still find our default alignment heuristics to perform well with all current sequencing technologies we have tested. For ITS as well, because the DADA2 algorithm only needs a good alignment between ITS amplicon reads if they are similar enough that it is conceivable one is a PCR/sequencing error of the other. Reads that are more different than that don't need to be aligned well -- they will be split into different ASVs anyway.

salias · June 12, 2024, 11:29am

Hello @benjjneb ,

Thank you so much for your comment. It seems that I didn't fully understand the way DADA2 corrects the sequences, but now it's clearer to me.

So, trying to summarize this really long thread :

The purpose of the parameters is to help modelling the error process, so it is a misconception to modify them to account for evolutionary changes. In this regard, we could say DADA2 is marker-gene agnostic.
Although DADA2 paper says that default values (of BAND_SIZE and KDIST_CUTOFF) should be re-examined for genetic regions such as the ITS, you (DADA2 team) found that defaults perform well for ITS. So we can confidently go with defaults.
As @Nicholas_Bokulich said, maxEE and truncQ values setting should also depend more on the sequencing process itself - not the marker gene. I assume that when you say:

You are not only referring to the alignment parameters, but also to maxEE and truncQ (that were also tested in my benchmarking). Am I correct?

Just for reference, the part of the thread where I discuss that point:

Again, thank you so much.

Best wishes!

Sergio

benjjneb · June 13, 2024, 5:02pm

That isn't what I meant there, but yes, we still find that our defaults work well across sequencing technologies.

system · July 14, 2024, 11:02pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.