Are Feature IDs hashed from reads?

Gil_Sharon · October 26, 2017, 8:20pm

Hi All,

This might be a silly thing to ask (or a suggestion) - are Feature IDs (as in the name of an sOTU) assigned to deblured or DADAed data in a pseudorandom manner, or are they hashed from the sequence (and thus contain that information). they're definitely long enough to contain long sequences.. I'm just wondering if there's a way to use it as a sanity check when crossing platforms (and reporting data).

Thanks!
Gil

jakereps · October 26, 2017, 8:55pm

You are correct! Feature IDs by default, for q2-dada2 and q2-deblur, are the md5 hash of the representative sequence itself. Both are assigned optionally (defaulting to True), and can be changed by using the hashed_feature_ids parameter for either method (--p-hashed-feature-ids / --p-no-hashed-feature-ids if you use the command line interface).

Gil_Sharon · October 26, 2017, 10:36pm

Cool!

Is there a way to decrypt hashed Feature IDs? as in a qiime tools unhash md5 (couldnt find an online dictionary for ATCG type words)?

thermokarst · October 26, 2017, 10:53pm

Hi @Gil_Sharon! Unfortunately there is not a way to reverse the md5 hash (MD5 is a cryptographic hash algorithm --- it is intended to be a one-way transformation for cryptographic purposes --- we use it in QIIME 2 because it is fast and relatively cheap to compute). As @jakereps mentioned, you can toggle hashing at runtime with q2-dada2 and q2-deblur, but this would require you to re-run your analyses from this step.

As far as using the hashed IDs for comparison purposes, the md5 sum of the sequence should always be the same (that is part of the point of a cryptographic hash), so if you see the same feature ID across datasets, this (most likely) comes from the same sequence (technically hash collisions exist in the MD5 space, which means multiple sequences can technically hash to the same md5 sum, but in reality this isn't a problem for our purposes). You can run feature-table tabulate-seqs to tabulate your seqs, which displays the feature ID and the actual sequence --- this is pretty helpful if you want to get back to the original sequences for investigation purposes.

system · November 27, 2017, 4:53am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.