Questions about collapsed abundance of taxa barplot !

Hi, I’m a biginner in QIIME2 world. (16S rRNA seqs)
I have questions at barplot with collapsed ASVs (by q2-taxa collapse).

Below are detailed information about our analysis.
In the beginning, I got pair-end fastq files with illumina miseq (v3-v4), and performed a standard pipeline in the moving pictures with q2-dada2 (forward right trim 280, forward left trim 21, reverse right trim 220, reverse left trim 17) program. After then, I used scikit-learn classification with naive-bayes classifier (database: 99% greengene). Also, the feature table was collapsed with q2-taxa collapse.
I often used the .csv files in a bar plot at qiime2 viewer, but I have some questions about relative abundance value in this barplot with collapsed ASVs.

I have two questions about the process where features in FeatureData [Taxonomy] become ASVs in taxa barplot csv file.

  1. In FeatureData [Taxonomy], there exist two kinds of features (rows) that are not identified at species level. For example, multiple rows belong to the genus Neisseria, but without species information. Those show either of two different annotations:
    Feature 1) … g__Neisseria.s__
    Feautre 2) … g__Neisseria
    How do the two annotations differ? Is there any difference in the meaning of feature 1 and 2?

  2. FeatureData [Taxonomy] contains multiple features that belong to feature1 and 2. However, when I downloaded taxa barplot as csv file and opened it, I saw only two ASVs:

ASV 1) … g__Neisseria.s__
ASV 2) … g__Neisseria.__

Therefore I guessed features that belong to feature 1 merged into a single ASV (ASV 1) and feature 2s also merged into a single ASV (ASV 2).

Can we regard ASV 1 and 2 as a single component when drawing stacked barplot (composition) at species level?

To elavorate, there, too, exist species-level ASVs such as;

ASV 3) … g__Neisseria.s__mucosa

Is it fair to compare ASV1, 2 with ASV 3 at species level? Or is it better not to perform species-level comparisons on composition when using 16S rRNA data?

These questions first arose when we downloaded taxa barplot data as csv file, and tried to draw composition plot at species level based on it. I’m greatly concerned that these "species-level unassigned"s might bias our interpretation on relative abundance data.

Above were two questions on ASVs, and I have a short question on representative sequences.

  1. I’m not sure about the meaning of representative sequences given for each ID in FeatureData [Sequence]. Are the sequences showing the representatives drawn from the data made from our analysis, or are they the representatives of each item from the database?

I know questions are quite long. Thanks in advance for your kind help.


Great Questions, KKN!
Thanks for the detailed post. In future, you might consider opening a separate post for each unrelated question, just to make them more digestible. (Question 3 should probably be a standalone in future, because it’s very distinct from the others.)

There’s a great discussion of those semantics here. In short, greengenes contains features that are annotated like Feature 1, with “empty” annotations at lower taxonomic levels. If your match looks like Feature 1, it means you got a species-level match with a feature that Greengenes has only annotated to the genus level. If your match looks like Feature 2, I believe that means you only got a match to the genus level. Clear as mud? Complicated semantics like this are one challenge with allowing “empty” taxonomic annotations.

Unless I’m gravely mistaken, QIIME 2’s taxonomy barplots plot taxon frequency per sample, and don’t directly treat Amplicon Sequence Variants (ASVs). As you mentioned, one taxon may represent multiple ASVs. My neighbor and I are both Homo sapiens sapiens (AFAIK), but we have distinct genomes. A barplot of neighborhoods in my town would group my neighbor and I in ...g__Homo.s__sapiens, because it is plotting the frequency of taxons per sample, not the ASVs per sample.

This makes me nervous. Without digging into the source code, I’m not sure exactly what’s going on, but I’m uncomfortable with the semantics here. ASV1 != ASV2, even if they happen to have been classified the same way. If your downloaded CSV is actually grouping ASVs by taxonomic annotation but labeling them with FeatureIDs rather than labeling them taxonomically, that might be worth reporting as a bug. Do you have the ability to confirm whether that’s what’s going on?

Perhaps, but this might be overgeneralizing - if your classified 16s data consistently has good annotation to the species or subspecies level in your database, why not use it? If it doesn’t, you’ll have to make a judgement call.

Generally, these taxonomic bar plots are intended as high-level diagnostic tools. You probably want to consider dedicated differential abundance tools (ANCOM could be a good starting place) if you’re interested in understanding differential abundance across samples.

Are your repseqs a dada2 result? If so, you probably haven’t used a database at that point in the analysis, and your FeatureData[Sequence] contains all unique ASVs produced during denoising, with corresponding FeatureIDs.

CK :pig_nose:

Oh, I’m so sorry to confuse you with the word “ASV”. :sneezing_face:
Taxon 1) … g__Neisseria.s__
Taxon 2) … g__Neisseria.__
I had to use the word “Taxon” here. I realized a huge problem from your kind answer… :slight_smile:

If I want to analyze abundance data at species level, is it correct to have all the unassigned species counts, such as “Taxon 2”? (I think Taxon 2 is an unknown ‘species’ in nature that has been annotated only down to the ‘genus’ level in the database… is it right?)

Thank you!

I haven’t forgotten about you, @KKN, I’ve just been under the weather. I’ll get back to this by Monday latest.

Hi @KKN,

@ChrisKeefe tagged me in for species opinions, and I hope it’s okay that I write a slightly long commentary about it. Take this with a grain of salt, and possibly read with a cup of tea :tea:, but here we go…

I tend to be pretty anti species level comparisons except in really specific circumstances. There are a few issues…

  1. Enviroment. There are some enviroments where you want or need species level resolution because it just makes a difference. The human vagina has major differences in closely related, known organisms and the ability to accurately distinguish between those species makes a difference. On the other hand, in the gut, you may find that species level resolution doesn’t really matter because the communites are complex, and species level annotation isn’t always good and so it’s not helpful or not accurate to provide species.

  2. Naming conventions and phylogeny. Taxonomy is often a question of what an organism looks like or how it behaves and not necessarily where it sits in the phylogenetic tree. There’s a bias in species level identification toward organisms that are culturable, and even there, the characterization wasn’t always good enough to allow full resolution. So, like, you may find a lot of names for members of class Gammaproteobacteria, because those are pretty easy to culture and name, but fewer for members of class Clostridia, because even thought the organisms are biologically relevant, they dont grow well in captivity.

  3. Phylogenetic Resolution. Sometimes we can’t make a distinction between organisms because of their biology. A classic example of this is in genuses Escherichia and Shigella, where it’s really hard to identify the genus, let alone the species, based on the 16S sequence. There tends to be a bias in the literature toward species names for cultured organisms. So, I think if you’re going to talk about species you need to be careful.

  4. Technical resolution. Robert Edgar wrote a paper a couple of years ago where he concluded that you need single nucleotide specificity for sequences if you’re working wtih short amplicons. (And like 99% identity for full length sequences.) I think because you ran DADA2 this is less of a concern, but it’s certainly something to think about when you work with your own or other’s data.

  5. Database Quality. If you use a general database, you may want to be more circumspect about presenting species information. I tend to be pretty careful wtih Silva (there are a few organsisms where I 100% accept species) but it’s pretty rare. Silva just pulls their species identifiers down and doesnt necessarily curate them further. On the other hand, an environment specific database (i.e. HOMD, Optivag, etc) may be better at species level resolution.

  6. Classifier quality. If you can’t swing an environment specific database, you will probably still get better or more accurate species level resolution with a higher quality classifier. You can get a bespoke classifier (i.e. clawback) that emphasizes environment specific organisms and improve your actual classification by trimming. … That way you at least get better species level resolution when it shows up.

  7. Does the name matter? I think we have this overall tendency as humans to want to name things, like if we can just label them, we can understand them. And to some degree, it’s true that having a name might let us understand the thing better. I think thats probably more true in places where we know what the name means, the distinction matters, and we have the ability to be accurate and precise. I)n my work (:poop:, :tongue:), accuracy has been more important than precision and I’m often willing to scarifice a species name for something that’s probably right-ish. Or me, the ASV often matters more than the name I stick to in. (Cue Shakespearean reference :rose:)

So, ultimately, when I work in my favorite complex environments, I tend to work at the ASV level and occasionally the genus or family level when people need me to. I tend to describe the features as something like "ASV from genus Neisseria (ASV 1) and then people can look at the centroid in a supplemental file if they think it’s relevant to them. I like to think it’s accurate, percise, but not necessarily in the naming.

…There are probably a lot of other opinions here, but that’s my two cents :money_with_wings:.



I think Taxon 1 represents an “unknown” that has been annotated only down to the genus level in the database. Your query sequence was a very good match for Taxon 1’s sequence, so it gave you “species-level” data.

I think your query sequence for Taxon 2 was not a good enough match for the classifier to choose a “species-level” match from the database. (Maybe there were multiple possible species-level matches for it, for example). As a result, you don’t get a species-level annotation from the reference database.

I’m not entirely sure how to interpret this question. It’s OK to work with taxa which have not been assigned species-level annotations in your database of choice, but I would caution you that if you’re performing statistical analysis on species grouped by taxonomic annotation, you could introduce some bias. (If many different species are assigned .s__, it’s not really fair to compare the count of many species with the counts of individually-annotated species.) Whether this is a problem in practice may depend on your database and what features you’re most interested in.

Justine, above, has given some great examples of challenges with species-level classification in 16s, and some good advice on how to work at the species level if you need to. The last little thing I’ll throw out there, is that I would probably recommend performing quantitative analysis on ASVs whenever you can, and then annotating them taxonomically once you have your results (e.g. for publication figures). This lets you take full advantage of your data. Clustering to OTUs at some percent identity is another option. In both of these cases, your statistical results are based on the sequence data itself, and may not be as dependent on the accuracy of your taxonomic classification.

And one little side-note for posterity - the .-delimited annotations seen here aren’t typical of actual QIIME 2 results, as QIIME 2 requires ;-delimited taxonomic annotations.


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.