I’m trying to compare the resulting sequences of a few different filtering pipelines. I’ve found that the same sequence shares the same HashIDs (good!) when comparing sequences filtered through Deblur and DADA2, but I can’t seem to find the same HashIDs in the Vsearch output.
I can confirm that both the Vsearch and Deblur and DADA2 representative sequences are trimmed to the same length, and that they were both given the same raw input fastq files to work with. What’s strange is that the number of characters of the HashIDs are different between Vsearch, and Deblur/DADA2 rep seqs. DADA2 and Deblur are always 32 characters, while Vsearch is always 40. I also noticed that the representative sequences in Vsearch’s fasta file are wrapped at 80 characters, but the DADA2 and Deblur fasta sequences are unwrapped.
I’m curious where exactly these labels are assigned - thanks for your comments!
Great question, I had never thought to look at this before now!
Looking at the code for q2-vsearch, it looks like we are passing the argument --relable_sha1. SHA1 is a different hash that what we use for DADA2/Deblur which is MD5. I think this explains the discrepancy in length of the hash and why none of the hashes match.
There’s no mechanism to re-hash the sequences, but since it sounds like you are doing a comparison of techniques, it wouldn’t be hard to programmatically reprocess the FASTA file.
This is likely a result of matching usearch’s API exactly than anything very intentional. In any case, it shouldn’t have an effect here
Could you point me to the code where the HashIDs are processed for Deblur/DADA2? I’m hoping there is something where I can just feed in the Vsearch-produced sequences and generate an MD5 hash’d label to replace the existing sha1 with.
If it’s not terribly difficulty, I could see how this might be helpful as an additional argument within the current vsearch plugin (so even after vsearch does it’s sha1 assignment, maybe add a layer that takes the resulting repseqs.qza fasta artifact and then reformat it into MD5). Something like:
Of course! An option would make sense, and it should be straight forward to conditionally reprocess the fasta file with MD5.
Here’s the location of the sha1 flag:
To relabel the IDs, you’ll need to modify the dereplicated_sequences object, this is just a fancy reference toi the file somewhere on disk (call str(dereplicated_sequences) for the filepath, or .path for the newer pathlib style object).
You’ll also want to conditionally change the id_map that way the table agrees on the new ID names.
Unfortunately I don’t speak Python so I’m fighting my way through the syntax now.
What I was thinking was that rather than modifying any sort of sha1 tag, I’d retain what is produced in the resulting output, then append it with the md5 tab. I think I see that in the deblur code you do it with just a few lines.
Could I just plug in a bit of code like that to the existing vsearch Python script?
My current brute force approach is going to be to just manually use vsearch to relabel the Deblur and DADA2 fasta files into the Sha1 style so that everything matches.
I’m interested in supporting the q2-vsearch plugin, and this thread brought up some interesting points. (Feel free to split this into the developer discussion.)
Which hash should we use for unique features?
Is there a recommended hash? We could easily switch vsearch over to md5 to match dada2, or we could expose a new option to let users choose a hash, as suggested by @devonorourke.
How long should lines be?
Vsearch can make unwrapped lines by passing --fasta_width 0, but I’m not sure what is the convention for Qiime 2.
Correct! Or maybe we could patch q2-vsearch so it matches by default
Come on @colinbrislawn - I’ve got a paper to write and a dissertation to finish up - no more ideas, just solutions!
Granted, I know nothing about the merits of one hashing algorithm over another, but assuming they’re both equal in their fidelity I’d opt for the one that produces the shorter string (which seems to be MD5). I’m sure the considerations aren’t that simple, but that’s my 2 cents.
Adding in the --fasta_width 0 parameter would help too, but I’m glad that the reason my hash IDs were different between vsearch and DADA2/Deblur had nothing to do with the number of lines of the sequence string!
That’s great! I had assumed that there was no such option, as it seemed weird to use SHA1 where MD5 is used elsewhere…
I like this option the best I think.
QIIME 2 has to handle either situation, so it doesn’t ultimately matter, but fixed-width fasta files are harder to parse and manipulate with unix tools, and just add extra bytes to the file for no reason.
I’m pretty sure line-wrapping has been available in terminals since the VT100, so in my personal opinion, fixed-width fasta is an abomination
Hey @colinbrislawn and @ebolyen,
I was trying my best to update this code but it doesn’t seem like it’s taking any effect.
I went into the _cluster_sequences.py script and made the following changes (comments to the right of the two lines with changes:
def dereplicate_sequences(sequences: QIIME1DemuxDirFmt,
derep_prefix: bool = False
)-> (biom.Table, DNAFASTAFormat):
dereplicated_sequences = DNAFASTAFormat()
with tempfile.NamedTemporaryFile(mode='w+') as out_uc:
seqs_fp = '%s/seqs.fna' % str(sequences)
cmd = ['vsearch',
'--relabel_md5', '--relabel_keep', ##changed to `_md5` from `sha1`
'--fasta_width 0', ## added this parameter to textwrap fasta string
But nothing seemed to happen. After I saved the file, I then started up the Conda environment where the script was associated with, and it started up with the generic QIIME is cashing your current deployment… message. I then wran my vsearch command:
Yet the resulting dna-sequences.fasta object neither hand the appropriate md5 header nor was it text wrapped.
@ebolyen mentioned that I need to change the id_map? Is there something else I should be editing?
I thought maybe it was because of the --relabel_keep argument being left in… but I tested that and it didn’t seem to make a difference if it was included or not (and at very lest, the fasta file should have been wrapped)
How would I go about finding that out?
I think with my version 2018.8 I possible had installed the version I pulled from your Git repo because months ago you patched the problem where vsearch/blast lacked the consensus length parameter; in the 2018.11 version I’m pretty sure that was made generally available. I tried updating the code in both versions and got the same result though…