Hello there forum,
I have a question regarding the generation of the ID for sequences, I ran vsearch-dereplicate to obtain a feature table and sequences for sanger amplicon fastqs (each amplicon comes from an specific strain with a code) which gave each strain a md5 hash. Now, I'm trying to assign the strain codes to each of the dereplicated sequences, I tried the following code over the original sequences
from Bio import SeqIO
import hashlib
for record in SeqIO.parse("B.fa", "fasta"):
print(record.id + "-" + hashlib.md5(str(record.seq)).hexdigest())
Which retrieved a different md5 hash that qiime2, e.g. for the same sequence
md5sum with python script: 2464604af7c0679fa3d0725dc35aa2bb
md5sum with qiime2: 7aab8b62bccc1fd45db47a97aaa5a0aedfc3d944
Am i computing the md5hash correctly? I followed the code referenced here: How are ASV IDs generated?,
Cheers,
Luis Alfonso.