Computing Feature hash ID

WeedCentipede · September 1, 2021, 9:49pm

Hello there :qiime2: forum,

I have a question regarding the generation of the ID for sequences, I ran vsearch-dereplicate to obtain a feature table and sequences for sanger amplicon fastqs (each amplicon comes from an specific strain with a code) which gave each strain a md5 hash. Now, I'm trying to assign the strain codes to each of the dereplicated sequences, I tried the following code over the original sequences

from Bio import SeqIO
import hashlib

for record in SeqIO.parse("B.fa", "fasta"):
print(record.id + "-" + hashlib.md5(str(record.seq)).hexdigest())

Which retrieved a different md5 hash that qiime2, e.g. for the same sequence

md5sum with python script: 2464604af7c0679fa3d0725dc35aa2bb
md5sum with qiime2: 7aab8b62bccc1fd45db47a97aaa5a0aedfc3d944

Am i computing the md5hash correctly? I followed the code referenced here: How are ASV IDs generated?,

Cheers,
Luis Alfonso.

timanix · September 2, 2021, 8:08am

Hi @WeedCentipede
Could you try to encode ('utf-8') sequence first?

Like this:

hashlib.md5(str(record.seq).encode('utf-8'))

WeedCentipede · September 6, 2021, 8:56pm

Hi @timanix

I tried and didn't work... I think it might be the way I'm parsing the fasta file.
I took the short path and used vsearch --search_exact and it worked like a charm.

Cheers,
Luis.

system · October 8, 2021, 2:56am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.