Where the output name comes after merging feature tables and sequences

This is a question about where the merged feature table and sequence names come from.
Let’s say I have 10 lanes of amplicon data. I have processed each lane of data separately to the point where I have a denoised frequency table and representative sequence set (feature table and feature sequences).

It looks like the FMT tutorial suggests that merging these tables and sequences is as simple as:

qiime feature-table merge \
  --i-tables table-1.qza \
  --i-tables table-2.qza \
  ... {more tables}
  --i-tables table-10.qza
  --o-merged-table table.qza
qiime feature-table merge-seqs \
  --i-data rep-seqs-1.qza \
  --i-data rep-seqs-2.qza \
  ... {more sequences}
  --i-data rep-seqs10.qza

  --o-merged-data rep-seqs.qza

When I go about merging these data, I was wondering what name would be preserved for each of the representative sequences. With DADA2 it used to name these as iSeqs, and I think in newer versions it renamed these as ASVs. I’m wondering what happens when you have two (or more) identical sequences getting merged from two (or more) datasets which have different ASV names.

Is the documentation in this program implying that the first feature id is what is retained?

If different feature data is present for the same feature
  id in the inputs, the data from the first will be propagated to the

Thus, if I had some representative sequence present in feature table 1, 2, and 8 that were getting merged, would it be likely that whatever ASV name was assigned to feature table 1 is likely what is then going to be applied to feature tables 2 and 8 also?

It can’t be quite that simple though, because there is the possibility that redundant ASV names are applied in each feature table, but those ASVs don’t have to represent the same sequence variant. Given that you savvy QIIMErs have solved every problem I’ve ever thought up (and more!) I’m wondering if you can help me understand the relationship between the input ASV names and the resulting merged table and sequence names. Is there no relationship?


Aha, but it is that simple!

Feature tables are merged on the sample/feature IDs — so if you have two feature tables that share feature IDs and sample IDs, those features and samples will be merged into one. The merge method does not even look at the sequences themselves, only the ID.

So the better question is — how do dada2 and deblur generate feature IDs that are unique identifiers for specific sequences, allowing this merge to happen? The feature IDs are just md5 hashes of the sequences themselves, as discussed here. Thus, you take an identical sequence, generate its md5 hash, and you will always get the same output. Additionally, there is an extremely low likelihood that any two sequences will have the same exact md5.

That level of sensitivity is problematic if your feature tables were generated with slightly different parameters. Your sequences could be identical except one feature table’s sequences are 1 bp longer… well, these identical sequences now have unique md5s and cannot be merged.

This is also a problem with, e.g., OTU clustering methods, since the feature IDs are arbitrary.

Now back to the merge method. It really is that simple and foolproof to merge features. Samples, however, rely entirely on the assumption that you know what you are doing This a problem if you recycle sample IDs and hence have duplicates shared between tables that do not actually represent the same sample (so names like “sample 1” are bad, “eye_of_newt_named_fred” may be a little better — better yet, generate unique sample IDs to massively diminish chances of overlap).

I hope that helps!

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.