taxonomy mismatch for the same feature ID

Shivani2211 · September 19, 2019, 4:59pm

Hello everyone,

I have two sets of data: group 1( pair-end) and group 2 (single-end). After performing taxonomy analysis (GreenGene database) on both data sets independently, I found some feature IDs similar to both, For e.g. feature ID 3907. But the taxon mentioned for the same feature ID in both the cases are upto different levels. Why so? If the feature ID is same, shouldn't the taxonomic resolution be same in all cases?
Group 1
- Group 2

Thanks so much!

colinbrislawn · September 19, 2019, 5:16pm

Good afternoon,

How did you build your feature table? It sounds like you processed these data sets totally separately, maybe using vsearch?

I ask because sometimes feature IDs should match between different runs, but other times they will be totally different. With vsearch, feature IDs are not going to match between runs, and it's a coincidence if they do. So you would expect totally different IDs and taxonomy between runs.

Colin

Shivani2211 · September 20, 2019, 2:30am

Hi Colin,

Thanks for your reply! Yes I processed the data separately as one group was single-end and other was pair-end.

I understand there will be different feature IDs in different runs. But after obtaining the feature ID, why am I getting different taxonomy for the same feature ID? As attached, group 1 and group 2 has 2 different taxonomy for the same feature ID 3907 (GreenGene). It seems a little off here.

colinbrislawn · September 20, 2019, 4:47am

The feature IDs are made in an arbitrary order, maybe starting with feature 1 as the most common sequence. So if you have features 1, 2, 3 on runA, you will have features 1, 2, 3 on runB, but... these will be the most common three features on each separate run.

You would not expect people with the same birthday to have the same name! In the same way, the taxonomy / name for each feature really matters. The feature ID / birthday is a little bit random, and you can safely ignore it.

Colin

Shivani2211 · September 20, 2019, 5:35am

I understood it backwards .. I thought feature IDs are unique to every taxonomy.
If this is the case: I have two more questions: 1. If I need to compare two sets of data for their diversity, I need to merge there feature tables and rep-seqs, right? So this merging happens based on the common feature-IDs or common-feature names? 2. Can you please let me know what I need to do to compare reads from single-end data and pair-end data? I have already tried (i) merging the pair-end seqs first and then denoising it like single end reads (ii) individual denoising of pair-end and single-end and then merging (iii) deblur and DADA2 both type of denoising tried as well

I am sorry to ask too man questions

colinbrislawn · September 20, 2019, 1:40pm

Lot's of great questions! Let's dive in!

Yes, especially for beta diversity which requires shared feature IDs to have meaningful results.

Yes, this happens based on IDs (not on taxonomy), which of course is a problem because your IDs are for totally unrelated sequences.

This is what I would try first, because dada2 creates matching feature IDs for the same reads, even if the reads are on different runs!!

So try running dada2 on each run separately, then use feature-table merge to combine them and see if your features do match across runs, like they should.
https://docs.qiime2.org/2019.7/plugins/available/feature-table/merge/
(To help make sure features merge, make sure your reads are all the same length and same region before joining. If paired is longer than it won't merge with single.)

Keep these great questions coming and let me know what you find!
Colin

Shivani2211 · September 22, 2019, 8:42am

I must say I am really grateful for the quick responses I get from you @colinbrislawn! Thanks so much!

My initial question was based on this. After separate runs of DADA2 and feature-table merge, I found few shared feature ID. The results are as follows:

grp-1

grp-2

grp-1+grp-2-merged

I suppose as both grp-1 and grp-2 matched to different levels of the feature-ID 3907, the merged file was confused as to assign exactly until which taxonomy and just left it as k_Bacteria. This is not a good merging case (I guess), as it should have merged to the highest over lapping taxonomy, in this case until g_Treponema.

Another example is:

grp-1

grp-2

grp-1+grp-2-merged

This is another poor merging because K_Bacteria is not common in both the groups. It should have classified it as "Unassigned" (I guess).

For my purpose, I am using the individual tables and merging them manually based on the taxonomy intersection. But as a user, I really hope the merging table feature to modify and be based on both Feature ID and taxonomy classification, so that the final result will be more reliable (and save the user from extra coding effort )

Thanks!

colinbrislawn · September 23, 2019, 12:27pm

Thanks for posting all of that. This is very helpful.

You are 100% correct.

Actually, it's a little worrying that the taxonomy didn't go all the way down to g_Treponema. I thought it would do that but I might be wrong...

There has got to be a more robust solution here, so let's call in the devs and see what they advise. @thermokarst, can you assign this to someone?

Colin

thermokarst · September 23, 2019, 2:23pm

Hi @Shivani2211!

How are you getting these Feature IDs? DADA2 doesn't have a way to directly produce numerical feature IDs like this. Please provide an Artifact or Visualization, if possible (that way we can look at provenance).

Merging FeatureTable[Frequence] should have no impact on the separate (but related) FeatureData[Sequence] and FeatureData[Taxonomy]. The only way this would be a problem is if you have overlapping Feature IDs in two different tables that refer to two entirely different ASVs:

Run 1

Feature ID: 4567
Representative Sequence: AAAAAA

Run 2

Feature ID: 4567
Representative Sequence: ACGTTT

These are two completely different ASVs, that happen to use the same Feature ID. This is one of the reasons why we use globally unique (or as close as possible) feature IDs. Does this make sense?

So, if you can provide more information about your workflow, and how you are creating these FeatureTable[Frequency] artifacts, we can point out where things are going wrong. Thanks!

Shivani2211 · September 23, 2019, 3:33pm

Hi @thermokarst,

Thanks for your response!

So, My workflow is as follows:

DADA2 (individual groups) --> cluster to 99% individually (Greengene) --> merge the clustered groups --> taxonomy assignment and downstream analysis

Yes, this is correct. The question is, since we have two different ASVs representing the same Feature ID, how the merging of the representative sequences happen? Let me show my run data :

grp-1:
- rep-seq
- taxonomy

grp-2:

grp-1+grp2-merged

As we can see, the merged data has kept the longest rep-seq (462) between the common feature ID i.e. grp-1. However, the taxonomy represented by the same rep-seq is different in both cases, i.e in grp1 it is until the genus level and in merged data it is until the kingdom level.

Is my merging after clustering causing the issue? Right now I am trying another way: merging ASVs first and then cluster. Waiting for the results.

Thanks,
Shivani

thermokarst · September 26, 2019, 2:10pm

Hi @Shivani2211, before I provide a detailed answer, can you please clarify, are you performing open-reference OTU clustering or closed-reference OTU clustering?

Shivani2211 · September 26, 2019, 3:24pm

HI @thermokarst,

I have used 99% OTU closed reference clustering from GreenGene data base.

Thanks!

thermokarst · October 2, 2019, 11:26pm

If you are performing closed-reference OTU clustering, then how can the feature IDs be the same between the two runs? I think you should take a closer look at what you have done. Feel free to share some QZAs or QZVs with us, we can use them to interrogate provenance.

system · November 3, 2019, 5:26am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.