Where the output name comes after merging feature tables and sequences

Nicholas_Bokulich · November 12, 2018, 7:42pm

Aha, but it is that simple!

Feature tables are merged on the sample/feature IDs — so if you have two feature tables that share feature IDs and sample IDs, those features and samples will be merged into one. The merge method does not even look at the sequences themselves, only the ID.

So the better question is — how do dada2 and deblur generate feature IDs that are unique identifiers for specific sequences, allowing this merge to happen? The feature IDs are just md5 hashes of the sequences themselves, as discussed here. Thus, you take an identical sequence, generate its md5 hash, and you will always get the same output. Additionally, there is an extremely low likelihood that any two sequences will have the same exact md5.

That level of sensitivity is problematic if your feature tables were generated with slightly different parameters. Your sequences could be identical except one feature table's sequences are 1 bp longer... well, these identical sequences now have unique md5s and cannot be merged.

This is also a problem with, e.g., OTU clustering methods, since the feature IDs are arbitrary.

Now back to the merge method. It really is that simple and foolproof to merge features. Samples, however, rely entirely on the assumption that you know what you are doing This a problem if you recycle sample IDs and hence have duplicates shared between tables that do not actually represent the same sample (so names like "sample 1" are bad, "eye_of_newt_named_fred" may be a little better — better yet, generate unique sample IDs to massively diminish chances of overlap).

I hope that helps!