Where does HashID get assigned in DADA2, Deblur, and Vsearch?

(Devon O'rourke) #1

I’m trying to compare the resulting sequences of a few different filtering pipelines. I’ve found that the same sequence shares the same HashIDs (good!) when comparing sequences filtered through Deblur and DADA2, but I can’t seem to find the same HashIDs in the Vsearch output.

I can confirm that both the Vsearch and Deblur and DADA2 representative sequences are trimmed to the same length, and that they were both given the same raw input fastq files to work with. What’s strange is that the number of characters of the HashIDs are different between Vsearch, and Deblur/DADA2 rep seqs. DADA2 and Deblur are always 32 characters, while Vsearch is always 40. I also noticed that the representative sequences in Vsearch’s fasta file are wrapped at 80 characters, but the DADA2 and Deblur fasta sequences are unwrapped.

I’m curious where exactly these labels are assigned - thanks for your comments!

(Evan Bolyen) #2

(Evan Bolyen) #3

Hey @devonorourke,

Great question, I had never thought to look at this before now!

Looking at the code for q2-vsearch, it looks like we are passing the argument --relable_sha1. SHA1 is a different hash that what we use for DADA2/Deblur which is MD5. I think this explains the discrepancy in length of the hash and why none of the hashes match.

There’s no mechanism to re-hash the sequences, but since it sounds like you are doing a comparison of techniques, it wouldn’t be hard to programmatically reprocess the FASTA file.

This is likely a result of matching usearch’s API exactly than anything very intentional. In any case, it shouldn’t have an effect here :slight_smile:

(Evan Bolyen) #4

(Devon O'rourke) #5

Thanks @ebolyen,
Could you point me to the code where the HashIDs are processed for Deblur/DADA2? I’m hoping there is something where I can just feed in the Vsearch-produced sequences and generate an MD5 hash’d label to replace the existing sha1 with.
If it’s not terribly difficulty, I could see how this might be helpful as an additional argument within the current vsearch plugin (so even after vsearch does it’s sha1 assignment, maybe add a layer that takes the resulting repseqs.qza fasta artifact and then reformat it into MD5). Something like:

qiime vsearch dereplicate-sequences \
  --i-sequences input.seqs.qza \
  --output-dir $DIRECTORY \

where the --p-makeMD5 is just a wrapper for whatever function you use in DADA2/Deblur…

(Evan Bolyen) #6

Of course! An option would make sense, and it should be straight forward to conditionally reprocess the fasta file with MD5.

Here’s the location of the sha1 flag:

To relabel the IDs, you’ll need to modify the dereplicated_sequences object, this is just a fancy reference toi the file somewhere on disk (call str(dereplicated_sequences) for the filepath, or .path for the newer pathlib style object).

You’ll also want to conditionally change the id_map that way the table agrees on the new ID names.

(Devon O'rourke) #7

Awesome @ebolyen,
Unfortunately I don’t speak Python :sob: so I’m fighting my way through the syntax now.
What I was thinking was that rather than modifying any sort of sha1 tag, I’d retain what is produced in the resulting output, then append it with the md5 tab. I think I see that in the deblur code you do it with just a few lines.
Could I just plug in a bit of code like that to the existing vsearch Python script?

My current brute force approach is going to be to just manually use vsearch to relabel the Deblur and DADA2 fasta files into the Sha1 style so that everything matches.

Thanks for the quick reply!

(Devon O'rourke) #8

Hold on a sec…
Maybe this was what you were saying all along…??

There’s a -relabel_md5 function in Vsearch!

Couldn’t I just substitute the '--relabel_sha1' command (line 126 of the script you shared) with --relabel_md5 and then everything would match across all sequencing filtering pipelines?

(Colin Brislawn) #9

Good morning!

I’m interested in supporting the q2-vsearch plugin, and this thread brought up some interesting points. (Feel free to split this into the developer discussion.)

  • Which hash should we use for unique features?
    Is there a recommended hash? We could easily switch vsearch over to md5 to match dada2, or we could expose a new option to let users choose a hash, as suggested by @devonorourke.
  • How long should lines be?
    Vsearch can make unwrapped lines by passing --fasta_width 0, but I’m not sure what is the convention for Qiime 2.

Correct! Or maybe we could patch q2-vsearch so it matches by default :wink:


(Devon O'rourke) #10

Come on @colinbrislawn - I’ve got a paper to write and a dissertation to finish up - no more ideas, just solutions! :smiley:

Granted, I know nothing about the merits of one hashing algorithm over another, but assuming they’re both equal in their fidelity I’d opt for the one that produces the shorter string (which seems to be MD5). I’m sure the considerations aren’t that simple, but that’s my 2 cents.

Adding in the --fasta_width 0 parameter would help too, but I’m glad that the reason my hash IDs were different between vsearch and DADA2/Deblur had nothing to do with the number of lines of the sequence string!

(Evan Bolyen) #11

That’s great! I had assumed that there was no such option, as it seemed weird to use SHA1 where MD5 is used elsewhere…


I like this option the best I think.

QIIME 2 has to handle either situation, so it doesn’t ultimately matter, but fixed-width fasta files are harder to parse and manipulate with unix tools, and just add extra bytes to the file for no reason.
I’m pretty sure line-wrapping has been available in terminals since the VT100, so in my personal opinion, fixed-width fasta is an abomination :smiley:

(Colin Brislawn) #13

Cool cool.

Plot twist: these issues are already on github, and past-Colin volunteered to work on them 6 months ago. :scream_cat:

Issue #55 for different hash IDs:

Issue #48 for fasta wrapping:

Let’s continue this conversation on Github so Devon can get back to his dissertation. :writing_hand: