I'm collapsing my feature table by family taxonomic level prior to running ANCOM-BC. I generated the QZV from the QZA collapsed table, and I found that there is a too general taxonomic assignation with a really high frecuency - in fact, the second highest frequency: k__Fungi;__;__;__;__. You can see it here:
(To clarify: I'm using ITS sequencing data, and I assigned taxonomy with UNITE database).
This is not the only too general taxonomic assignment appearing here: I also spotted things like Unassigned;__;__;__;__ or k__Fungi;p__Ascomycota;__;__;__ (but those have less frequency so I'm less worried about them).
I'm not sure about what would be the best way of work with these general assignments. For now, the options I'm considering are:
Perform ANCOM-BC with the collapsed table directly.
Rule out all too general taxonomic assignation I find.
Rule out too general taxonomic assignations only if they have low frequency. In my case, that would mean that I keep k__Fungi;__;__;__;__, but I exclude the rest.
Any suggestions? I believe there is a consensus of how to deal with too general taxonomy but I don't know why I cannot find that exact issue in the forum.
I think you have outlined three good options. I would like to suggest a 4th.
run ANBOM-BC at the ASV level (and add taxonomy information later)
This may be more difficult bioinformatically, but means the ANCOM results will not be biased by taxonomy. Because taxonomy can be hard for ITS, this is super helpful.
Thank you so much. Yes, in fact my workflow runs ANCOM-BC at ASV, species, genus and family levels. For ASV level I'll load the taxonomy in R and map hashes to taxonomies prior to plotting.
For now I'll try that without ruling out general taxonomies and I'll see what happens. Maybe those are really abundant in all samples but not differentially abundant so they are not going to bother me - if they do, it is quite likely that I come back here if I find any issue when running ANCOM-BC on filtered tables.
Hi @salias ,
I would go with your option 2 (removing poorly classified sequences) after checking a few with NCBI BLAST to see what else they may hit.
Most ITS primers can amplify non-fungal eukaryotes, and if these are not represented in your database you will often get these poor classifications. So it is often a good idea to use the UNITE fungi +_eukaryote database to detect these non-target hits.
Presumably you don't want non-fungi in your survey, so I would just remove them (after confirming that they are non-fungal or junk etc)
This might not be a good idea either, as you might then still include these in alpha and beta diversity and other measurements where having non-target sequences present could lead to misleading results. But that depends on your biological question and methods...
Yes, I use the UNITE version with all eukaryotes and remove all the non-fungi I come across (k__Alveolata , k__Metazoa, k__Viridiplantae, etc). I was specially worried about the Unassigned and the too general Fungi annotations, so I kept those.
Okay, I'll check with BLAST and if I find too general fungal matches I'll remove them. If I find something specific maybe it's time to RESCRIPt.
Yes, currently my diversity analyses include those too general fungal results. I do diversity analysis directly on ASVs (because as far as I understood ASVs, collapsing before diversity would mean losing the advantages of using ASVs). So I suppose what I should do is use the taxa barplot visualization to make sure the uncollapsed table is okay before running diversity.
Apart from those too general kingdom-level assignments, I also found entries that only have information until phylum level (e.g. k__Fungi;p__Ascomycota;__), class level, order level and so on. Should I keep those e.g. when I want to do ANCOM-BC on a family-collapsed table? Or should I be permissive until one category above (e.g. allow no more general than order-level assignments when using family-collapsed table)?
In the past I have found that anything that does not classify to at least class or order level is usually also junk sequences, e.g., non-target etc. So I would check the ASVs that only classify to phylum level to confirm, too, but they are probably something that should be removed as well...
Sorry for bringing this back again, but I've been thinking about this:
I think that, maybe, those sequencies with only a fungal phylum associated (or simply k__Fungi;__;__;__;__) that does not match with anything meaningful in NCBI BLAST could be actually uncultured / uncharacterised fungi not present either in UNITE or BLAST. So if the biological cuestion is related with "search for as yet undescribed fungi", we should keep them. And when performing e.g. ANCOM-BC it should be done on the ASV level (because if we taxa collapse, maybe two uncultured fungi annotated as k__Fungi;p__Ascomycota;__;__;__ are considered as the same one).
Maybe I'm overthinking it but in my head this approach sounds nice.
well, it depends on your samples and question. How likely is it to find a novel fungal phylum in your samples? Discovering a totally novel phylum is certainly possible, but unlikely, with the degree of unlikelihood depending on where you look.
And even if it were a new phylum, you would still expect this to hit other fungi in the nr database with NCBI BLAST, just with a poor alignment. So inspect the alignments, and if it hits nothing it's questionable (more likely junk than a new phylum). Make sure to search the full nr database with these, do not restrict to a specific group.
Yes indeed, I would always do differential abundance testing on ASV level (even if taxonomic levels are tested separately), for the reason that some ASVs could be differentially abundant and that is interesting. But I would only do this on valid ASVs, as differential abundance of, e.g., non-target DNA is probably not useful (but this also depends on your experimental goals and question and system and etc)
Okay, I understand now. I was thinking that somehow e.g. k__Fungi;p__Ascomycota;__;__;__ could be assigned by the classifier because the sequence is equally likely to belong to two species of different classes (e.g. class c__A and class c__B), but in real life what is happening is that the species belong to class A but it is not sufficiently well described. But you are right, in those cases the most likely scenario is that the sequence is junk (or a novel class, but that would be very very very rare).
Thank you so much and sorry for cluttering the forum with my questions
Your first inference is one possible scenario; the other possibility is that the sequence simply does not resemble any class with a sufficiently high degree of probability.
On the contrary, thanks for the lively discussion! These are great questions, and you are not alone in searching for answers...