Unclassified at the phylum level

microbiome_25 · August 28, 2024, 11:01am

I would like to ask about unclassified bacteria at the phylum level.

I am using Greengenes2 (2022.10 full length sequences) for taxonomic classification, using "qiime feature-classifier classify-sklearn" on QIIME2 version 2023.9.
I got d__Bacteria;__ and d__Bacteria;p__ at the phylum level.
They seem to be treated as different phyla in the classification, but should I merge them into one phylum category (unclassified bacteria)?
I have seen discussions in this forum that it is better to discard bacteria that are not classified at the phylum level.
However, I could not find published papers mentioning this in their method sections.
Could you please give me some insight on this as well?

Thank you very much!

salias · August 28, 2024, 4:28pm

Hello!

Functionally, they are the same (lack of) phylum. The difference is that, for d__Bacteria;__ the classifier couldn't assign any taxonomy beyond the domain level; whereas for d__Bacteria;p__ the classifier found a match but that match is not annotated at phylum level. So, long story short: yes, they can be understood as the same on a practical level. This post addresses the same issue.

Yes you could merge them into one "Unclassified" phylum, although I personally prefer to get rid off these too general taxonomic annotations. I came to this conclusion when I asked this question a couple of months ago.

Sadly there are a lot of bioinformatic work out there with a Methods section that does not allow replication due to too shallow explanations (not to mention those with no shared code at all...). I suppose this is because a lot of people still think of bioinformatic tools as a black box, and they follow default steps with default variables that they assume they don't need to mention in Methods. Anyway, if you want an example of a Methods section where this filtering is stated, here you have one. From its Methods section:

The table and sequences were filtered to exclude any ASV without phylum-level annotation or which could not be inserted into the phylogenetic tree.

I hope this is useful for you.

Best wishes!

Sergio

microbiome_25 · September 2, 2024, 5:13am

Hello @salias
Thank you very much for your answer.
I have five more questions.

Could you clarify what "too general taxonomic annotations" are?
Does this mean taxa that are not annotated at the phylum level when analyzing 16S rRNA genes at the phylum and genus level?

Thanks for providing this information.
3. It appears that the taxa with too general annotations were excluded before the relative abundance of each taxon was calculated, and these taxa were not included in the taxa abundance table. Is this correct?
4. When you construct a phylogenetic tree and calculate alpha and beta diversity metrics on QIIME2, do you use the filtered table and sequences?
5. Is it correct that filtering sequences based on annotations is important when comparing taxonomic abundances, not when calculating diversity metrics?

Thank you very much.

salias · September 2, 2024, 2:37pm

Hello again!

They are basically poorly classified sequences, those who are annotated only until e.g. phylum level. For example, things like these (Fungi example because I'm a Fungi guy):

Unassigned;__;__;__;__

k__Fungi;p__Ascomycota;__;__;__

The most likely explanation for these is that they are non-target sequences. You can BLAST a few of them if you want to make sure before discarding them.

Yes it is.

Yes.

I would filter for both differential abundance and diversity, otherwise you may get misleading results.

Cheers,

Sergio

roachjm-unc · September 3, 2024, 12:35pm

It depends a little bit on what community you are actually looking at the sequencing results for, but in my experience, these classifications tends to correspond to eukaryotes / host (i.e. human, mouse, plant, whatever). The 16S universal primers will amplify mitochondrial and chloroplastic DNA, so depending on your situation you may be getting some (or possibly a lot) of reads corresponding to one or more of those.

Check out the BLAST results in the representative sequences qzv for the OTUs / ASVs / sOTUs that get the unclassified / poorly resolved taxa classifications. Then, if they do correspond to human or mouse or plant or whatever, you can do an alignment against that reference (bowtie2 / bwa whatever) to remove those. Then process the remaining reads as you ordinarily would for 16S.

I've found that Kraken2 does a bit better job at identifying and removing the host reads, but depending on your host, building a new Kraken2 database that includes the host may be more work than you are looking for.

salias · September 4, 2024, 2:29pm

Hello @roachjm-unc and @microbiome_25

While I have little experience with Kraken2, I know that other mods like @colinbrislawn also suggest using Kraken2 for this purpose (see line 0083 here).

Also (thanks @SoilRotifer for noting that!), if your number of poorly classified sequences is relatively big, you may want to check if your reads are in mixed orientation. You can look for more info on mixed orientation in the forum. While q2-feature-classifier needs the reads in the same orientation as the classifier, tools like vsearch or kraken are fine with any read orientation, so they will provide you with a "correct" taxonomy either case. And if you are moving to do something else with your sequences (e.g. phylogenetic tree), you don't want your reads to be in mixed orientation.