I work in a pathology lab and we have generated a large environmental DNA dataset over a period of year and across multiple sites. In this data we have been looking for regulated pathogens of plants and we have found quite a number of them which is interesting.
However, being a pathology lab we are less interested in counts and more interested in presence and absence data of these regulated pathogen. High or low read counts between samples and sites really doesn't matter to us because their presence at any level is a concern. Instead we wish to examine the discrete presence and absence of these pathogens.
Rather than feeding count data into ANCOM, can we transform the feature table to 1s (for presence) and 0s (for absence) and run an ANCOM on that instead? I have read online that this might affect some of the assumptions used in ANCOM.
Can anyone advise if this would be an appropriate use of ANCOM?
My recommendation would not be to use ANCOM-BC for this application. I've used a possion regression the past (Panel 1e of this paper and check the methods) and that might suit your needs better.
In my large community datasets, I tend to define a threshhold for detection (e.g. at least 1/1000 reads, usually based on my rarefaction depth) and then look at the relative risk of carriage. If you dont like setting the depth, you could also adjust your poisson regression for teh sequencing depth - so your observation is conditioned on how many reads. I haven't compared the two, though.
I actually think the statistical approach is the easy part. I think defining "present" can be hard, particualrly given noise in microbiome data. I tend to define "present" as a minimum relative abundance, if you define it as a single (or number of reads) make sure you're considering well-to-well splash over and reagent contamination.
You may also want to be careful with your taxonomic assignments and database. Making sure you have the right out groups is important, and you may want to BLAST hits to make sure you're finding what you're looking for. Of particular note, E. coli is indistinguishable from Shigella in 16S rRNA sequencing (and much of metagenomics) so making sure your database is consistent to detect those correctly is important.
We had been looking at a poisson regression for this data so it is good to have confirmation that this is an appropriate statistical test.
Regarding defining what is present, this is a challenge and we are going to caveat the eDNA findings as being the first step in an investigative process rather than an end result.
We can't and wouldn't use these results and immediately take further action, our the diagnostic standard requires proof and we would need to perform follow up testing which would be physical examination and further diagnostics. We are also going to caveat the taxonomic assignments too.
There is a large amount of work to be done to validate this eDNA before it can be widely used in our area.