Where does HashID get assigned in DADA2, Deblur, and Vsearch?

devonorourke · January 8, 2019, 4:30pm

I'm trying to compare the resulting sequences of a few different filtering pipelines. I've found that the same sequence shares the same HashIDs (good!) when comparing sequences filtered through Deblur and DADA2, but I can't seem to find the same HashIDs in the Vsearch output.

I can confirm that both the Vsearch and Deblur and DADA2 representative sequences are trimmed to the same length, and that they were both given the same raw input fastq files to work with. What's strange is that the number of characters of the HashIDs are different between Vsearch, and Deblur/DADA2 rep seqs. DADA2 and Deblur are always 32 characters, while Vsearch is always 40. I also noticed that the representative sequences in Vsearch's fasta file are wrapped at 80 characters, but the DADA2 and Deblur fasta sequences are unwrapped.

I'm curious where exactly these labels are assigned - thanks for your comments!

ebolyen · January 8, 2019, 5:19pm

Hey @devonorourke,

Great question, I had never thought to look at this before now!

Looking at the code for q2-vsearch, it looks like we are passing the argument --relable_sha1. SHA1 is a different hash that what we use for DADA2/Deblur which is MD5. I think this explains the discrepancy in length of the hash and why none of the hashes match.

There's no mechanism to re-hash the sequences, but since it sounds like you are doing a comparison of techniques, it wouldn't be hard to programmatically reprocess the FASTA file.

This is likely a result of matching usearch's API exactly than anything very intentional. In any case, it shouldn't have an effect here

devonorourke · January 8, 2019, 5:27pm

Thanks @ebolyen,
Could you point me to the code where the HashIDs are processed for Deblur/DADA2? I'm hoping there is something where I can just feed in the Vsearch-produced sequences and generate an MD5 hash'd label to replace the existing sha1 with.
If it's not terribly difficulty, I could see how this might be helpful as an additional argument within the current vsearch plugin (so even after vsearch does it's sha1 assignment, maybe add a layer that takes the resulting repseqs.qza fasta artifact and then reformat it into MD5). Something like:

qiime vsearch dereplicate-sequences \
  --i-sequences input.seqs.qza \
  --output-dir $DIRECTORY \
  --p-makeMD5

where the --p-makeMD5 is just a wrapper for whatever function you use in DADA2/Deblur...

ebolyen · January 8, 2019, 5:32pm

Of course! An option would make sense, and it should be straight forward to conditionally reprocess the fasta file with MD5.

Here's the location of the sha1 flag:

github.com

qiime2/q2-vsearch/blob/9d7aa9dd4689a19e3bb2f9a0631208920b9aaeb8/q2_vsearch/_cluster_sequences.py#L126


      
          
          def dereplicate_sequences(sequences: QIIME1DemuxDirFmt,
                                    derep_prefix: bool = False
                                    )-> (biom.Table, DNAFASTAFormat):
              dereplicated_sequences = DNAFASTAFormat()
              with tempfile.NamedTemporaryFile(mode='w+') as out_uc:
                  seqs_fp = '%s/seqs.fna' % str(sequences)
                  cmd = ['vsearch',
                         '--derep_fulllength', seqs_fp,
                         '--output', str(dereplicated_sequences),
                         '--relabel_sha1', '--relabel_keep',
                         '--uc', out_uc.name,
                         '--qmask', 'none',
                         '--xsize']
                  if derep_prefix:
                      cmd[1] = '--derep_prefix'
                  run_command(cmd)
                  out_uc.seek(0)
                  table = _parse_uc(out_uc)
              id_map = {e.metadata['description']: e.metadata['id']
                        for e in skbio.io.read(str(dereplicated_sequences),

To relabel the IDs, you'll need to modify the dereplicated_sequences object, this is just a fancy reference toi the file somewhere on disk (call str(dereplicated_sequences) for the filepath, or .path for the newer pathlib style object).

You'll also want to conditionally change the id_map that way the table agrees on the new ID names.

devonorourke · January 8, 2019, 5:47pm

Awesome @ebolyen,
Unfortunately I don't speak Python so I'm fighting my way through the syntax now.
What I was thinking was that rather than modifying any sort of sha1 tag, I'd retain what is produced in the resulting output, then append it with the md5 tab. I think I see that in the deblur code you do it with just a few lines.
Could I just plug in a bit of code like that to the existing vsearch Python script?

My current brute force approach is going to be to just manually use vsearch to relabel the Deblur and DADA2 fasta files into the Sha1 style so that everything matches.

Thanks for the quick reply!

devonorourke · January 8, 2019, 5:52pm

Hold on a sec...
Maybe this was what you were saying all along..??

There's a -relabel_md5 function in Vsearch!

Couldn't I just substitute the '--relabel_sha1' command (line 126 of the script you shared) with --relabel_md5 and then everything would match across all sequencing filtering pipelines?

colinbrislawn · January 8, 2019, 5:58pm

Good morning!

I'm interested in supporting the q2-vsearch plugin, and this thread brought up some interesting points. (Feel free to split this into the developer discussion.)

Which hash should we use for unique features?
Is there a recommended hash? We could easily switch vsearch over to md5 to match dada2, or we could expose a new option to let users choose a hash, as suggested by @devonorourke.
How long should lines be?
Vsearch can make unwrapped lines by passing --fasta_width 0, but I'm not sure what is the convention for Qiime 2.

Correct! Or maybe we could patch q2-vsearch so it matches by default

Colin

devonorourke · January 8, 2019, 6:02pm

Come on @colinbrislawn - I've got a paper to write and a dissertation to finish up - no more ideas, just solutions!

Granted, I know nothing about the merits of one hashing algorithm over another, but assuming they're both equal in their fidelity I'd opt for the one that produces the shorter string (which seems to be MD5). I'm sure the considerations aren't that simple, but that's my 2 cents.

Adding in the --fasta_width 0 parameter would help too, but I'm glad that the reason my hash IDs were different between vsearch and DADA2/Deblur had nothing to do with the number of lines of the sequence string!

ebolyen · January 8, 2019, 7:11pm

That's great! I had assumed that there was no such option, as it seemed weird to use SHA1 where MD5 is used elsewhere...

I like this option the best I think.

QIIME 2 has to handle either situation, so it doesn't ultimately matter, but fixed-width fasta files are harder to parse and manipulate with unix tools, and just add extra bytes to the file for no reason.
I'm pretty sure line-wrapping has been available in terminals since the VT100, so in my personal opinion, fixed-width fasta is an abomination

colinbrislawn · January 8, 2019, 11:08pm

Cool cool.

Plot twist: these issues are already on github, and past-Colin volunteered to work on them 6 months ago.

Issue #55 for different hash IDs:
https://github.com/qiime2/q2-vsearch/issues/55

Issue #48 for fasta wrapping:
https://github.com/qiime2/q2-vsearch/issues/48

Let's continue this conversation on Github so Devon can get back to his dissertation.
Colin

devonorourke · February 6, 2019, 7:55pm

Hey @colinbrislawn and @ebolyen,
I was trying my best to update this code but it doesn't seem like it's taking any effect.
I went into the _cluster_sequences.py script and made the following changes (comments to the right of the two lines with changes:

def dereplicate_sequences(sequences: QIIME1DemuxDirFmt,
                          derep_prefix: bool = False
                          )-> (biom.Table, DNAFASTAFormat):
    dereplicated_sequences = DNAFASTAFormat()
    with tempfile.NamedTemporaryFile(mode='w+') as out_uc:
        seqs_fp = '%s/seqs.fna' % str(sequences)
        cmd = ['vsearch',
               '--derep_fulllength', seqs_fp,
               '--output', str(dereplicated_sequences),
               '--relabel_md5', '--relabel_keep',  ##changed to `_md5` from `sha1` 
               '--fasta_width 0',  ## added this parameter to textwrap fasta string
               '--uc', out_uc.name,

But nothing seemed to happen. After I saved the file, I then started up the Conda environment where the script was associated with, and it started up with the generic QIIME is cashing your current deployment... message. I then wran my vsearch command:

qiime vsearch cluster-features-de-novo \
--i-sequences p41.vsrch.nonchimera.seqs.qza \
--i-table p41.vsrch.postChimera.table.qza \
--p-perc-identity 0.97 \
--p-threads 4 \
--o-clustered-table p41.vsrch.postChimera.p97clust.table.qza \
--o-clustered-sequences p41.vsrch.postChimera.p97clust.seqs.qza

qiime tools export --input-path p41.vsrch.postChimera.p97clust.seqs.qza --output-path p41.p97clust.seqs

Yet the resulting dna-sequences.fasta object neither hand the appropriate md5 header nor was it text wrapped.

@ebolyen mentioned that I need to change the id_map? Is there something else I should be editing?

I thought maybe it was because of the --relabel_keep argument being left in... but I tested that and it didn't seem to make a difference if it was included or not (and at very lest, the fasta file should have been wrapped)

Nicholas_Bokulich · February 7, 2019, 6:52pm

@devonorourke do you have q2-vsearch installed in development mode?

devonorourke · February 7, 2019, 7:18pm

How would I go about finding that out?
I think with my version 2018.8 I possible had installed the version I pulled from your Git repo because months ago you patched the problem where vsearch/blast lacked the consensus length parameter; in the 2018.11 version I'm pretty sure that was made generally available. I tried updating the code in both versions and got the same result though...

Nicholas_Bokulich · February 7, 2019, 10:25pm

Install like so (skip first step obviously if you have already cloned the repository):

git clone https://github.com/nbokulich/q2-vsearch.git
cd q2-diversity
pip install -e .

Make your changes to the script in that directory (if you have not done so already), and all will be well.

thermokarst · February 7, 2019, 11:03pm

You can also run make dev in all of our plugin repos (this runs the same pip command that @Nicholas_Bokulich mentioned above).

system · March 11, 2019, 5:03am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.