Orders of Feature IDs in Feature Table and in Taxonomy are not the same - what are they based on?

Hi everyone,

I’m trying to produce a file that has the following components (columns):

  • Feature ID
  • Feature Frequency
  • Taxon name
  • Classifier’s confidence value

So far I managed to get the above data from two files:

  • A CSV file from Feature Table, from the first tab on the QZV file that says “Frequency per feature detail” (feature-frequency-detail.csv). This file contains Feature ID and Feature Frequency.

  • A TSV file from FeatureData[Taxonomy], which is the output of qiime feature-classifier classify-sklearn. This file contains Feature ID, Taxon, and Confidence value.

…but I found that the Feature IDs in these two files were not ordered in the same way all the way to the bottom. Specifically, they match each other when there is only one entry per feature frequency (up to row #234 in the image below), but after that, if there are multiple entries per frequency, there does not seem to be an order? (please see the image below)

Does anyone know if this is just random or if they’re sorted by some rules? If so, what are the rules? (I need to have this information for the downstream analyses).

Thank you so much for your kind help!
I’m so grateful for this forum - you guys are amazing!

@fgara, the data you’ve shared don’t look randomly-sorted to me. A close look at the data will probably answer your question.

I popped open a FeatureData[Taxonomy] of mine, and it looks like it’s sorted by Feature ID. I say this, because Feature IDs are all in alphabetical order, and Confidence Values and Taxon notations are not.

Similarly, the csv you pulled from the FeatureTable[Frequency] viz appears to be sorted by frequency. (That column is in descending order, but the Feature ID column is not).

Have you tried sorting both sheets by Feature ID before combining them? Unless one of the sheets contains duplicate feature IDs (and I believe they shouldn’t), I suspect that will do it.

Big picture, what are you trying to accomplish by exporting and combining data? There might be an easier way to do so through QIIME 2, which won’t break your study’s provenance.

Best,
Chris

1 Like

Hi @ChrisKeefe,

Thanks for your reply! :slight_smile:
Yes, I understand that they are each sorted by different things, but not by the same thing when there are multiple entries per frequency (it is sorted when there is only one entry per feature frequency).

What I’m trying to do:
I want to have the following in the same file:

  • Feature ID
  • Feature Frequency
  • Taxon name
  • Classifier’s confidence value

So I can sort them by feature frequency (and maybe by confidence value too) and get the taxa names of the features that have 10% of the highest and 10% of the lowest frequencies.

Is there a way to do this in QIIME 2?

Many thanks for your time and generosity :pray: :slight_smile:

Sorry, @fgara, I should have been more specific when I asked what your goals were. I was wondering what your big-picture goals were in generating this file. (e.g. are you using this diagnostically, to make decisions about some downstream analysis, are you trying to generate some finished data for a publication, etc.) QIIME 2 doesn’t have anything purpose-built to create a report like this, but if what you need is just this spreadsheet, we can make that work.

Here is what I would do, in any spreadsheet software. (You could also tackle this with Python, R, or any other programming language, but since you’ve got spreadsheets already, let’s roll with that for now.)

  1. First, match your data based on the feature ID column, but don’t keep the feature ID column.
  2. Sort by whatever you like, after you have combined your data.

In greater detail - you know that each Feature ID is unique. Each one of those unique features has three data you’re interested in, but they are split across two different spreadsheets. You have two choices here:

  1. Sort both sheets by FeatureID (most spreadsheet programs have a Data->Sort tool), and then paste the data into one combined sheet as you’ve attempted. Because you’re sorting by Feature ID, you should have matches all the way down the first two columns. Assuming this is the case, you can delete one of the duplicate columns, and move on to sorting your data. If it isn’t, try again carefully, or report back here. This is what I would do.
  2. Alternatively, you can use formulas to populate your new sheet. The syntax may vary based on what program you use, but you would create a new column in your FeatureData tsv, and add feature frequencies to that column based on Feature ID matches. This might require less manual checking, but would probably take more thinking, and I’m lazy. :slight_smile:

The key here, is that you have to match your data on Feature ID before you can start sorting it. Once you’ve got a spreadsheet with only your four desired columns, you can sort it however you like.

Best,
CK

1 Like

Wow, thank you so much @ChrisKeefe for your detailed reply!
I will try to follow your suggestions.
Thank you once again!

1 Like