Impact of number of bootstraps on RDP classification

Micro_Biologist · January 24, 2025, 9:09am

Hello all,

Myself and a colleague were comparing differences in taxa recovery from two different extraction kits on diatom/algal communities in rivers. Samples were processed in duplicate (not ideal I understand but this is actually a reanalysis of existing DNA extracts) one with each kit and these were processed separately but identically in terms of wet lab and bioinformatics processes.

As part of the bioinformatic process (briefly a probably familiar cutadapt/dada2/RDP (assignTaxonomy)) each dataset had the taxonomy assigned separately and then compared. The two kits were more different than we initially expected (had previously compared 22 samples as a trial) and so we decided to try different levels of confidence and noticed that sometimes even at the same confidence taxa assignment (mainly the level it was assigned) fluctuated.

We dug into what was happening here and essentially it seems to come down to the low number of bootstraps (100) using during taxonomic assignment. We ran some tests against a troublesome sequence, assigning taxonomy to it 500 times using 100, 1,000, and 10,000 bootstraps. We found that the confidence at 100 bootstraps fluctuated by about 5+/- of the consensus value obtained from 10,000 bootstraps, where as 1,000 bootstraps fluctuated 1+/-. Of course this isn't a lot but could impact final taxonomic assignment if the minBoot was set near the consensus confidence. Additionally this didn't appear to actually take much longer to run although I am going to test this on a larger dataset!

TL:DR; More bootstraps seems to be better, so why is the default 100?

For some context I work as part of a research team where one of my main focuses is investigating how/where we can use DNA monitoring tools for regulatory use, and as such my threshold for non-repeatability may be slightly higher than most!

Micro_Biologist · February 4, 2025, 7:50am

As a quick update it seems the time to execute is not the same (although this is unsurprising!) and scales with the number of bootstraps ran on our test data (algae/diatoms to diat.barcode).

On the test workstation (AMD 5975 w/ 512GB RAM) it took:
100 bootstraps = 2.8 minutes
1,000 bootstraps = 24.28 minutes
10,000 bootstraps = 4.584 hours or 275 minutes

I'm going to run each 500 times (I'm not sure I'll be patient enough for 10,000...) and compare variance. Hopefully, I can find a happy medium of bootstrap confidence and the number of bootstraps for our data.

system · March 7, 2025, 1:50pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.