Merging DADA2 feature-tables. What about likely duplicate features in different tables?

lca123 · November 11, 2019, 2:37pm

Hi there,
I am about to filter and retain a lot of samples from feature-tables and them merge it in a merged-table.
These are soils samples, sequencing runs were done with the same primers/conditions and so DADA2 processing. The only difference is that data present in one of those tables was dereplicated with --p-maxee 0.5 while for the others I’ve set --p-maxee 2. However, I don’t see a problem here.
What I am thinking is the following: I am merging different feature-tables carrying different samples, however I am sure there are ASVs shared between those tables ( not sure how many) because many samples came from very close places. However, as they were processed independently, they’ll likely have different labels.
So, in the merging process, is there any “checking” for equal ASVs in terms of DNA sequence but with different labels? And if not, could someone provide me a hint on how to deal with it? I am affraid of ending up having unreal abundance and diversity results because of it.
I am considering align all the features I get from the merged-table and consider the ones with 100% Id as the same, but not sure if this is ok.

All the best,

yanxianl · November 11, 2019, 3:31pm

Hi Leo, the same ASV will always have the same hashed feature ID when you denoise your reads in QIIME2. So you don’t need to worry about producing different labels for the same ASV when merging feature tables from different runs.

For more information on merging feature table and representative sequences, check the qiime feature-table merge and qiime feature-table merge-seqs. There’s also a qiime2 tutorial on merging feature tables.

Yanxian

jwdebelius · November 11, 2019, 3:54pm

Hi @yanxianl and @lca123,

Its worth noting that the MD5 has is almost always unique, so if you want to be absloutely sure on the merge, don’t hash your feature IDs. Its a bit more obnoxious to deal with the longer strings, but at least you know for certain that you didn’t lose any features.
(This, by the way, does assume that your ASVs are the same length and start from the same primer position).

Best,
Justine

lca123 · November 11, 2019, 4:33pm

Thank you guys!
I am still a bit confused here: a given ASV will likely have the same hash id even though it came from different sequencing runs, right? Given that the lenght, primer etc are the same… Which means DADA2 somehow use sequence composition to assign a hash id and “knows” which hash id to assign? (my interpretation, but should come back to dada paper and read a lot more)

I am fine if MD5 hash is almost always unique, so I could just merge the tables setting any of the --overlap-methods, given that my sample IDs are unique and hash ids are likelly unique. I’ll then get a table where ASVs tend to not be repeated, in terms of DNA sequence.
Is that right?

jwdebelius · November 11, 2019, 4:51pm

HI @lca123,

Yep, if I have an ASV that is CATCATCAT in sequencing run 1 and I have have CATCATCAT in sequencing run 2, then they will be merged as a CATCATCAT asv. The hash ID is an MD5 hash, which is a property of the ASV and is, in fact, agnostic to denoiser. I don't know the math really matters, beyond the fact that it's a 16 character string that gets derived from your full ASV. Im also not actually sure if its a DADA2 implimentation or a layer added in QIIME 2. But, if I use my example, got my CATCATCAT ASV from deblur or DADA2, or even from clustering in vsearch, the hash ID for the sequence is 7a15ab6f986bdd5d39bcc3f2b6997d53).

Yes! Everything that mapped to 7a15ab6f986bdd5d39bcc3f2b6997d53 in each run will be retained, and the feature ID wont be repeated.

Best,
Justine

lca123 · November 11, 2019, 4:58pm

Thank you again, Justine.
That is awesome.
All the best,

yanxianl · November 11, 2019, 6:52pm

I wasn’t aware that the same hashed feature ID may encode more than one ASV. Justine is right that the exact sequence is a better unique identifier for the ASV, which is the case when you run the DADA2 in R. If you prefer not to use hashed feature ID for merging feature tables, it can be disabled when you denosie your reads using DADA2 or DEBLUR in QIIME2.

system · December 13, 2019, 12:56am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.