How to convert DADA2 --p-no-hashed-feature-ids file to --p-hashed-feature-ids file?

Dear QIIME2,

I have analyzed my study sequence files (from 2 runs) with QIIME2 having “--p-hashed-feature-ids”. I wanted to compare my result with another published important study from the same population. I obtained the sequence files from ENA. But the ENA sequence files had a lot of broken extra lines almost all of the individual sequence files and thus could not get imported into QIIME2. Therefore, I used DADA2 R package and generated the table.qza and rep-seqs.qza as was instructed here. As was indicated there, this table cannot be merged with other tables because it is equivalent to feature_table.qza with --p-no-hashed-feature-ids flag. Now, how can I convert this “--p-no-hashed-feature-ids table.qza” to “--p-hashed-feature-ids table.qza” so that allows me to merge these tables? Also what modification do I need for the rep-seqs.qza file?

Regards,
Nazmul

3 Likes

Hi @nhuda6,

There might be a tool to "clean up" fastq files so that QIIME 2 has a better time consuming the data.

Are the primers between your study and the ENA files the same? And are the trim/truncation parameters used in DADA 2 the same, meaning that the reads in both data-sets have the chance to turn into the exact same ASV (in length and content)?

If so, then the good news is our --p-hashed-feature-ids are just md5-sums of the sequence. That means if you have the same sequence, you will always get the same md5-sum. I'm not super familiar with R programming, but MD5 is such a widely used algorithm you should be able to do some processing in R before writing it to a file.

It may also be easier to re-run the QIIME2 DADA2 step with --p-no-hashed-feature-ids. The only reason we use the hashed version is they are a bit shorter and easier to keep track of. Otherwise there is (usually) no difference to the computer which variety of ID you use.

Hello @ebolyen

Thank you so much for your reply. I have the following follow-up questions.

I did search for a tool that can clean up a number of fastq file in a folder but could not find any. Since I am not an expert, I believe there may have some tools to do it. Please recommend me if you have any in your mind.

Yes, the primers are same.

No, I used different trim/truncation parameters for all 3 QIIME2 (2 from my study [have overlap IDs] and 1 the ENA) analyses. After processing the length is different but the content I believe is equivalent.

Please correct me if I am wrong. My understanding is that the DADA2 do the error correction on a single lane at a time. We may need to use different trim/truncation parameters for different runs depending on the quality of the illumina run. I think QIIME2 address this issue brilliantly with the --p-hashed-feature-ids option which allows us to merge multiple feature tables from multiple runs with different quality, different trim/truncation parameters, and overlap sample IDs.

Thank you again for developing QIIME2 and your kind support.

Regards,
Nazmul

Hi @nhuda6!

That is all true, however the trim/truncation need to be carefully selected to make sure that your reads between runs have the same oppurtunity to become the same ASVs.

The hashing is unfortunately not quite so magical, it has the same issue with the length of the sequence as the sequence itself. Observe the following:

$ echo ACGT | md5sum
58ce66d7df0a1cf9b360cabf43da3ea5  -
$ echo AACGT | md5sum
64457308e4632188163a6bd53a9385d8  -

Here I've taken the md5sum of two sequences ACGT and AACGT. They have the same content, but a slightly different length. The md5 algorithm assigns wildly different values to those two sequences. The exact same algorithm is used when you pass --p-hashed-feature-ids and so it will have the same difficulty when your trim/trunc parameters aren't consistent.

Now there is a bit of subtlety here. If you have paired-end data, you really only need to synchronize the trim-left parameters between runs. The truncation length can be variable, because the forward and reverse reads will be merged, which means it doesn't matter quite as much where you truncate things (the read-pair will be dropped if it fails to overlap). However you still need to make sure that your sequences all start at the same position on the forward reads and reverse reads.

As an example, if I had 3 runs, I might look at all of them and decide that setting a trim-left-f of 10 made sense. I would pass that same value for each run. I would also look at the reverse reads, and decide that a trim-left-r of 0 made sense, and I would again pass that parameter for each run. The trunc-len can vary because these reads will be merged. The end result is every final ASV is coming from the exact same position with respect to your primer-pair and is therefore comparable between runs, meaning your data can be merged in a meaningful way.

If you have single-end data, then the trunc-len also must match between runs as your reads are not bound by your reverse primer like in the paired-end situation.

Sorry for all the text, I hope that was helpful.

Thank you so much. It helps. Now, I do understand the hashing better.

Regards,
Nazmul

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.