Hello all,
Myself and a colleague were comparing differences in taxa recovery from two different extraction kits on diatom/algal communities in rivers. Samples were processed in duplicate (not ideal I understand but this is actually a reanalysis of existing DNA extracts) one with each kit and these were processed separately but identically in terms of wet lab and bioinformatics processes.
As part of the bioinformatic process (briefly a probably familiar cutadapt/dada2/RDP (assignTaxonomy)) each dataset had the taxonomy assigned separately and then compared. The two kits were more different than we initially expected (had previously compared 22 samples as a trial) and so we decided to try different levels of confidence and noticed that sometimes even at the same confidence taxa assignment (mainly the level it was assigned) fluctuated.
We dug into what was happening here and essentially it seems to come down to the low number of bootstraps (100) using during taxonomic assignment. We ran some tests against a troublesome sequence, assigning taxonomy to it 500 times using 100, 1,000, and 10,000 bootstraps. We found that the confidence at 100 bootstraps fluctuated by about 5+/- of the consensus value obtained from 10,000 bootstraps, where as 1,000 bootstraps fluctuated 1+/-. Of course this isn't a lot but could impact final taxonomic assignment if the minBoot was set near the consensus confidence. Additionally this didn't appear to actually take much longer to run although I am going to test this on a larger dataset!
TL:DR; More bootstraps seems to be better, so why is the default 100?
For some context I work as part of a research team where one of my main focuses is investigating how/where we can use DNA monitoring tools for regulatory use, and as such my threshold for non-repeatability may be slightly higher than most!