I have a quick question about the md5 hash assignment for feature IDs in rep-seqs files. I recently read this post (Feature-table merge and merge-seqs does not eliminate duplicate sequences) after I came upon the same issue of having to merge rep-seqs files from different runs in which one run had md5 hashes as the feature IDs while the other two runs had the sequence as the feature IDs.
My question is does the md5 hash have some sort if meaning, i.e. is there a specific hash assigned based on the sequence or is it just a random generator that would give a different set of feature IDs for each rep-seqs file? I am trying to decide whether it is worth deblurring again with the --p-no-hashed-feature-ids parameter to generate the md5 hashes, or to make a python script that will either make unique feature IDs that are consistent between each of the runs (or I could generate md5 hashes this way, as described in the other post).
If I were to make a python script, which may be quicker, do you know if this would affect downstream analysis in any way?
Yes, the md5sum will be “unique” for each unique sequence (i.e., a unique sequence will always have the same md5sum, but I say unique in scare quotes because there is an extremely small possibility that two text strings will have the same md5sum)
A python script would be ideal — and something you could contribute as a method that others could use in QIIME 2! The bash script I wrote here will do the trick, but this issue has come up for QIIME 2 users a few times now so a formal QIIME 2 solution would be ideal. Please do write that script and submit a pull request to add it as a QIIME 2 method!
It should not if features are being relabeled correctly.
I have been working on this python script and am close to having a working one, however, I have only been learning python for the last few months and so it probably isn’t at the level required to submit as a pull request.
I do have another question for you (when you get the chance - I know everyone is busy, so no rush). Is dereplication of the rep-seqs even necessary? i.e. will the sequence be counted twice, resulting in errors in downstream analysis? Or does QIIME account for this in some way?
That’s exciting! How about this — you can submit this code as a pull request, with the expectation that the PR will not be merged (immediately, at least). Instead, we can use that as a place to comment on the code… hopefully this will be useful to you as you learn python, and it will help you work on and improve that code. Eventually the PR will be merged once the code is in good shape. It is your call entirely! The one caveat is that I cannot make any guarantees about how quickly any of the QIIME 2 devs can review and comment on your PR…
(similarly, you are always welcome to tackle any open issues in any of the QIIME 2/plugin repositories… we mark good beginner issues with “help wanted” and “good first issue” tags, so this is always a good way to get involved and learn some python in the process )
Could you please clarify? Do you specifically refer to the post you linked to above regarding dereplication of duplicate rep-seqs that really are the same sequence but have different feature IDs? It is important that these get merged appropriately — otherwise they will be considered unique features even though they are in fact the same. This will completely skew any kind of, e.g., diversity analysis. For example, you can take a single set of samples, process identically but label the rep seq feature IDs in two different ways (md5 hash and arbitrary IDs), then merge and those samples will appear 100% dissimilar merely because they contain “unique” features. Theoretically this might not be as big of an issue depending on the methods you are using in QIIME 2 (e.g., phylogenetic diversity methods should not be sensitive?) but many other methods would be very sensitive.