Standard vocabulary for Qiime

colinbrislawn · April 14, 2018, 6:45pm

On this forum post a user did not know the meaning of some qiime terminology.

Great! We can finally get rid of 'metadata mapping files'.

Is this controlled vocabulary written down somewhere?

If 'frequency' causes confusion, what if we changed the wording to 'counts' and 'proportions'?

Help Qiime devs! Arguing about what to call things is one of my favorite pastimes!

Colin

Nicholas_Bokulich · April 16, 2018, 1:21pm

Hi @colinbrislawn,
Great point about terminology! We do have a short glossary that needs expansion (it does not contain the terms that you've listed), and could probably benefit from greater visibility. Let's see what others say though — there may be another list of these terms somewhere.

If you have ideas for how to make this glossary more visible, we'd love to hear. And if you'd like to contribute via a docs PR, please jump right on in!

ebolyen · April 16, 2018, 10:43pm

We tried to match entry-level statistics in this case, so the terminology was pretty intentional. Ultimately almost any term we use seems to be pretty well overloaded. Sometimes I do feel weird about Frequency though as it's more like FrequencyForDepthMakingStatisticsReallyHard, the same problem holds for Counts too.

We definitely need more glossary terms, and probably something like a "QIIME 2 in 2 minutes" page.

mortonjt · April 16, 2018, 11:13pm

I actually like the term Frequency, imho I think it is fairly intuitive. In addition, it is commonly used multiple other fields such as natural language processing, electrical engineering, ...

If anything, this could serve as a bridge to other disciplines. So I vote to keep this as is.

colinbrislawn · April 17, 2018, 4:28pm

Thanks for sharing the initial glossary!

I like the idea of having the glossary as a technical document for developers, so that we can write documentation and tutorials that implicitly teach standard vocabulary. Qiime 2 in 2 minutes is a perfect place to do this. What a great idea!

On Frequency

The use of Frequency is not standard, see this ABA guide, and the main wiki page which defines Frequency as observations over time... which could be analogous to observations of one feature over all observations in a sample... BecauseAllOurCountsAreBasicallyRelativeAnyways.

I was hoping to find a clear word, but they all have sticky connertations. I guess AbsoluteFrequency and RelativeFrequency would avoid this, but we know that our counts are not necessarily Absolute, so this wrong for other reasons.

Thanks for talking this over with me.

Colin

ebolyen · April 17, 2018, 4:53pm

I mean, I guess that's fair, but I don't really expect people to confuse signal processing with microbial ecology.

As to the behaviorist journal, it seems like the discussion centers more on what represents the statistical sampling event, e.g. it's a rate for applied behavior analysis. This seems pretty consistent with the statistical definition, so I'm less certain what their aim is.

In all of these cases, frequency is still a number of observations per unit, whether that unit is time, area, population, or PCR+sequencing.

mortonjt · April 17, 2018, 5:18pm

That definition of frequency still applies -- here we are talking about the number of microbial occurrences within a single sample. It follows the same rationale why we can apply algorithms such as DESeq2 on our data in the first place (otherwise we can't apply the Poisson on our data, since it, well, only operations on a single time unit).

I think a more solid example is to look at how the NLP folks use the term word frequencies: Word list - Wikipedia

colinbrislawn · April 17, 2018, 5:34pm

Well said! I like how 'observations' implies discrete counts and 'per unit' hints at the unequal sampling effort of PCR+sequencing.

Similar wording shows up in Jamie's link

frequency of occurrence within some given text corpus

Frequency meaning counts per unit make sense to me now. Thanks for helping me clear this up!
Colin