Many unclassified taxa

Benedict · March 29, 2019, 5:21am

Dear experienced users,

Currently, I performed taxonomy classification using naive-bayes pre-trained classifier (green genes) and found there are many unranked taxa at the deeper rank, mostly classified to class level only. Besides that, some taxa named with weird name (e.g: A00456) rather than a precise scientific name. Is this a common issue?

I also did classification by classify-consensus-blast using NCBI 16S reference datasets and found there're many unranked taxa beyond the order level, but at least the nomenclature is much better than greengenes (no numerical value as taxa name). Does it mean the min-consensus only reach to order level, that's why the deeper rank can't be annotated due to lower consensus threshold?

Apart from that, can I know the explaination of consensus taxonomy?

Regards,
Benedict

Nicholas_Bokulich · March 29, 2019, 12:37pm

Sounds like you are getting similar results with both. Unless if you are using the wrong database for both (e.g., wrong primer target), it sounds like your sequences are just too short and do not contain enough information to perform better. If both the naive Bayes classifier and consensus classification do poorly, your sequences are probably too short for better.

Could you use qiime metadata tabulate with your taxonomy results and sequences as inputs and share the results here? This numeric name does not make sense — there is nothing in the database called "A00456" so the classifier should not supply that classification. There are many other orders/taxa that have numeric names, usually indicating that they are an uncultured/unnamed clade, so weird names at order level are not too surprising.

Try the help docs:

$ qiime feature-classifier classify-consensus-blast --help
Usage: qiime feature-classifier classify-consensus-blast [OPTIONS]

  Assign taxonomy to query sequences using BLAST+. Performs BLAST+ local
  alignment between query and reference_reads, then assigns consensus
  taxonomy to each query sequence from among maxaccepts hits, min_consensus
  of which share that taxonomic assignment. Note that maxaccepts selects the
  first N hits with > perc_identity similarity to query, not the top N
  matches. For top N hits, use classify-consensus-vsearch.

Benedict · April 2, 2019, 9:04am

I think it doesn't ascribed to databases as I was using pre-trained greengene database obtained from qiime2 resource page. For sequence length, all the samples are around 400bp and correspondingly to length of V3-V4, is it considered short?
May I know if greengene database has many unranked taxa at deeper rank?

Nicholas_Bokulich · April 2, 2019, 12:03pm

Aha, so it is the first problem I listed, wrong primer target. Your sequences are longer than the V4 amplicons used to train that pre-trained classifier, and so that classifier will not work. The pre-trained full-length 16S classifier should work for you, or else you will need to train your own classifier on V3-V4.

There are many sequences assigned ambiguous labels at family, genus, and species level — this indicates that sequences in that OTU cluster cannot be definitively resolved at that taxonomic rank.

Benedict · April 3, 2019, 9:10am

I was using the classifier pre-trained by Sir Mehrbod_Estaki at Available: Pre-trained classifier of V3-V4 (341F, 805R) region with gg_99. So, I think my targeting region is correct.
I share my taxonomy classification result for the clearer picture. gg-rel-class.tsv (14.7 KB)
gg-rel-order.tsv (28.8 KB)
gg-rel-family.tsv (48.6 KB)
gg-rel-genus.tsv (76.4 KB). There're numerous unranked taxa starting from class level to genus level.

Apart from that, I have doubts about the greengenes naming as shown in the enclosed pic below. What do TM7, TM6, BH180-139,OD1, GN02 stand? How about the phyla name with bracket (e.g: [Thermi] ) ? Should I filter those numeric names for publication?
gg-phyla gg-phyla-2

jwdebelius · April 3, 2019, 9:21am

Hi @Benedict,

Glad to see you figured out your classifer!

This is a common problem in 16s sequencing for a variety of reasons. They may have to do with the environment/sample type, database limitations, or the biological reality of the sample type. For isnstance, we don't have the resolution to detect the difference between E. coli and Shigella flexneri in 16s sequencing because they're too evolutionarily close and the definition of a "species' in bacteria is complicated anyway. (Is it a species if it behaves differently? What about horizontal gene transfer via viruses? Is that independent reproduction? Bacterial sex is complicated!)

I would retain the numeric names because those are the phylum names. The brackets indicate the taxonomic name is contested but not that you should filter those, either. This has to do with the fact that morphology-based taxonomy doesn't always align with phylogeny and so as we understand more about molecular phylogeny the taxonomy needs to be discussed.

That said, Id suggest two exceptions to the "keep them" suggestion. The first is that I recommend collapsing rare clades for a bar plot. I typically only keep the first 6 phyla, and maybe the first 10 or 11 families because I cant find a colormap that lets my color blind friends resolve the difference in the bar chart at any shallower resolution. (Note that this is my opinion, and not everyone subscribes to it.) Second, if you're doing a feature-based analysis, I would filter out low abundance ASVs. If members of those phyla are at low abundance (for your definition of "low"), I'd exclude them.

Best,
Justine

Benedict · April 3, 2019, 10:03am

May I know which script u applying to collapse the rare clade and retain top phyla or families in q2?

I did apply this strategic at 0.01-0.05% but somehow some phyla that present only in my treated sample but not control lost. Do you know any method to retain those low abundance yet significance taxa but at the same time remove those beyond threshold level?

jwdebelius · April 3, 2019, 10:47am

I typically pull my taxa table over to my favorite plotting program (excel, matplotlib, ggplot, take your pick) and generate the plot there. I like what might be an obnoxious amount of control over my figures.

If the phylum was present at less than 0.01% on average, it may well only be present in your control samples. But, depending on your sample size, it's probably too low abundance to be meaningful in a statistical test. You could may try filtering by min sample frequency instead, like a taxa has to be in at least 10% of samples.) But, it's not appropriate to keep only taxa you want when you filter. You either do all or nothing, you don't get to just select the ones you think are significant; that will bias your results.

Best,
Justine