we have an issue with a reviewer, who still doesn’t accept the ASV naming, and keeps asking:
Introducing the very long QIIME ASV names is not suitable for reading. This should be changed into understandable names.
What should we reply to this comment? I am not keen to rename ASV in to OTUx, just to please the reviewer will.
Any hint is much appreciated
Hi @sghignone ,
Welcome to the forum and thanks for moving your question over to the forum.
This is not unique to QIIME 2 — the denoising methods wrapped in QIIME 2 (e.g., dada2) also use md5 hashes for “short” uniquely identifiable sequence names. So this is not a convention introduced by QIIME 2.
Second: Using the md5 hash makes these IDs unique and interoperable so an alternative to this will either not be short (e.g., use the full sequence!) or not be unique/interoperable (arbitrary “OTU IDs” are meaningless and cannot be compared between studies).
Something we have discussed elsewhere on the forum is using, e.g., shortened md5 hashes so that plots etc can be labeled with short readable names (e.g., using the first 6 characters of the md5 hash)… then a table can be provided in the supplement that maps each short ASV ID to the full ASV ID, the assigned taxonomy, and optionally the sequence itself.
That might be a fair compromise between reviewer/readability and common sense (i.e., not assigning arbitrary IDs).
Thanks Nicholas for fast reply!
My only concern is: which is the probability to find duplicates when shortening to first 6 chars the md5 hash?
the probability is higher but this can of course be tested — those shortened hashes would not be unique or comparable between studies, but within an individual study there is a low likelihood of namespace clashes (i.e., shortening to 6 characters will still probably yield IDs that are unique within a single study). So I recommend trimming and then checking for duplicates, and lengthen if needed.
This is also the reason to have a supplement table that maps the short IDs to the full IDs: since the full IDs uniquely map to ASVs (extremely low likelihood of namespace clash), but the shortened IDs would theoretically have namespace clashes when comparing all possible sequence space…
I experimented in such way with a quite big dataset with lots of ASVs and found that first 8 characters are enough to keep unique names. For figures, 4-6 chars should be enough.
I also noticed that some people are renaming ASV hashes as ‘ASV1,ASV2’ and so on, but it looks a little bit strange to me.
I prefer to put the last available taxa unit, to which ASV is assigned, adding to it first 4-6 chars from hashes