How are ASV IDs generated?

schillebeeckx · September 18, 2018, 2:47pm

I'm curious as to how the IDs assigned to ASV (e.g. 00137a07c917bab6cfcb89a796f69a38) are generated as I'm trying to merge multiple DADA2 feature tables together.

Is it some sort of hash of the representative sequence?
Can the ID be guaranteed to be the same for independent DADA2 runs assuming the representative sequence is the same between those runs?
Can I assume the reverse-complement of the representative sequence is always guaranteed to yield the same ID?
Is it safe to assume the representative sequence cannot be "reverse engineered" if one only has the ASV ID?

Nicholas_Bokulich · September 18, 2018, 3:00pm

Hi @schillebeeckx,
Good questions!

Yes

No — I believe that will generate a unique ID

Correct. Which is why a FeatureData[Sequence] artifact is still produced by dada2, to map the IDs to their sequences.

So if you are having trouble merging, make sure:

the sequences are in the same orientation
the sequences are trimmed/truncated to the same length and sites.

I hope that helps!

colinbrislawn · September 18, 2018, 11:47pm

Hello @schillebeeckx,

Long time no see!

Just to get specific, these are all the md5 hash of the read. relevant code

So many of your questions are about md5 hashes and collisions.

Essentially yes. MD5 collisions are extremely rare, but theoretically possible.

Yes; it's always going to be the md5 hash of the Reverse Complement read (which of course is going to be different than the forward non-complement read).

Sure it's safe to assume that... unless...
Finding the sequence from only the md5 hash is technically possible from an information security standpoint. Don't consider this header to be a security feature!

Colin

P.S. If you find two normal amplicon sequences that have the same md5 or sha hash, I would LOVE to know about it. I'm sure all the devs who use these hash functions would too!

schillebeeckx · September 19, 2018, 12:13pm

Hey Colin, happy to be hearing from you. Thanks for confirming my thoughts; I'll let you know if I find any clashes!

Mehrbod_Estaki · September 19, 2018, 10:27pm

I say we start a pool to guess the taxa origin of the first ASV hash clash. Winner takes all.
$50 one will be from Lachnospiraceae

John_Chase · September 20, 2018, 6:52pm

Is it safe to assume the representative sequence cannot be “reverse engineered” if one only has the ASV ID?

I would argue that it is not safe to assume that the sequence can't be found. But this question really depends on what level of privacy is needed. If one were so inclined it would not be too hard to create hashes for different 16S regions from a database such as Green Genes and compare those against a data set to get the sequence from the hash. Of course this would not work on sequences that were not in a reference database, and someone would have to have a lot of time on their hands and be really interested in what you were doing.

This question has actually come up in my work, and I am curious about other's thoughts as well

colinbrislawn · September 21, 2018, 6:12pm

I agree. Well said.

We need to unambiguously state that the md5 hash is too fast to be cryptographically secure, and assume that folks can figure out the source sequence. This should not be taken as a privacy or security measure.

Colin

system · October 23, 2018, 12:12am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.