Sample-Metadata FeatureTable Graphs - question on unlabeled column

slh277 · January 9, 2018, 2:30pm

Hello,

When looking at my feature table, I notice that some of the graphs have an unlabeled bar graph/column - for example the below, the column on the far right:

I noticed it seems to happen if there is missing data (marked by either a "." or "[blank] "), or if there is text like "NA" when the other values are 0/1 such as the below table example, from the Atacama tutorial FeatureTable (tutorial metadata link here):

a) is this because of the missing data? (I ran the Keemei add-on to Google Sheets and there were no issues with the metadata table)
b) is the graphical representation indicating that the sample-metadata.tsv file is not being read correctly? or is it just giving an example?

Many thanks for your help!!

colinbrislawn · January 9, 2018, 7:06pm

Good morning,

I think you are absolutely correct, and one or both of these things is happening; either data is missing and is being filled with this blank column and/or your metadata is not being read correctly by qiime.

Would you be able to post a link to your metadata file? I understand that this might not be possible or data could be private, and there are other ways for us to solve this problem. But I think this would be the fastest way for us to look for clues.

Thanks,
Colin

slh277 · January 9, 2018, 7:56pm

Hi @colinbrislawn,

Thank you! Is there a particular way to format missing data in the spreadsheet?

I made a smaller version (the metadata file I have is quite large and yes, it is private) with some examples of variables that read correctly (column B) and ones that do not (column C, which does not have missing data but has that same unlabeled column in the graphical output; a categorical variable (column D) & a binary variable (column E) which DO have missing data indicated as blank, and also have the unlabeled column).

The abbreviated dataset is here: https://docs.google.com/spreadsheets/d/17IRsTXmI22TpNPba1PVbYdhU7HSTxO-Vv3bKN4HoPSQ/edit?usp=sharing

For the Atacama tutorial data, it is happening on columns K through T (i.e variables pH through Temperature) - which all seem to have both numeric and character variable data. Perhaps this mixing of char/num data is also an issue?

colinbrislawn · January 9, 2018, 9:41pm

These are really great questions about how Qiime handels metadata. That sheet is a great way to explore what's happening after the sheet is imported, and a good way to keep private data private.

Let's see if we can get feedback from Jai @jairideout, the qiime developer who also helped make Keemei...

Colin

jairideout · January 11, 2018, 2:27am

Hi @slh277 and @colinbrislawn! The NA values in the Atacama tutorial metadata (and the example metadata @slh277 linked) are being interpreted as missing values. The feature-table summarize plot is displaying the missing data in an unlabeled column (rightmost bar in the plot).

Up to this point, the QIIME 2 Metadata file format didn't specify how to format missing data. It turns out that NA, among other values, are currently being interpreted as "missing data" (there's an open issue tracking this bug). In the upcoming 2018.2 release, the QIIME 2 Metadata file format will allow empty cells to represent "missing data", and all other values (including NA) will be interpreted as actual metadata values. We decided to only support the empty cell for "missing data" in order to avoid cases where NA represents a real metadata value such as "North America", etc. in a study.

We'll follow up here when the new Metadata file format (with "missing data" support) lands in a release!

slh277 · January 11, 2018, 4:08pm

Great, thank you for the information @jairideout and @colinbrislawn ! I'll look out for the 2018 release.

Can I ask one more question? For the graph below, there are 0, 1, and empty cell/blank values in the metadata file (for n=53 samples).

However, as you can see below, only ~seven 1's show up on the graph, and the remaining forty-five values are all seen as missing, despite there being ten empty cells, thirty-three 1s, and ten 0s. Shouldn't all the values, including the blanks despite not being 'missing' per se, be graphed together (respectively)? just wondering how to get around this...

Since empty cells are not being seen as "missing data", currently, what are the ways missing data can be seen for the time being? Would it be a good idea to replace all empty/missing data cells with "NA" for the time being (as NA is not a real metadata value in my dataset)?

side note - in the next release, could missing data be indicated by empty cells, as well as periods (".") like how missing data are generally handled in SAS? I realize SAS may not be as commonly used in this area, but just putting it out there...

jairideout · January 11, 2018, 7:40pm

That output looks pretty suspicious... you may have found a bug in feature-table summarize. Can you please share your test data set with me so that I can try to reproduce the plot locally? I'll need the metadata file, feature table, and the exact command you ran. Thanks!

In the current version of QIIME 2, both the empty cell and NA are interpreted as missing data, so you can use either. There are some other values that will currently be interpreted as missing data, and we'll fix that in the upcoming 2018.2 release as well. These "missing data" values are the default values supported by pandas.read_csv, which is the TSV parser used to load Metadata (we won't be using pandas to parse Metadata files in the next release, and only the empty cell will represent missing data). Here is a complete list of "missing data" values that are currently supported, in addition to the empty cell:

#N/A
#N/A N/A
#NA
-1.#IND
-1.#QNAN
-NaN
-nan
1.#IND
1.#QNAN
N/A
NA
NULL
NaN
n/a
nan
null

Whew!

Thanks for the suggestion! While it would be cool to support additional "missing data" values used by SAS or other software, we're only comfortable supporting the empty cell as "missing" to avoid clashes with values that users intend to represent actual data. Since QIIME 2 doesn't enforce any standards for representing metadata, we think the empty cell is the only "safe" value we can reliably use to represent missing data, because it couldn't possibly clash with a user's "real" data. While a period is unlikely to represent actual data, we can't guarantee that for all users, and if we support SAS "missing data", that opens the door to supporting other "missing data" values used by R, pandas (see above), [insert my favorite software tool here], etc. Due to the lack of standardization in the field (actually, across all fields using delimited file formats, yikes), we're avoiding taking any stance/preference on what values represent missing data. It's restrictive and won't make all users happy, but at least it'll be predicable, easy to document and educate users about, and hopefully lead to more reliable analyses.

Thanks for bringing up these ideas! I think our discussion here will be a useful reference for other users having similar questions.

slh277 · January 11, 2018, 8:20pm

That's great to know, thanks for posting that!

... totally understandable!! I appreciate your explanation.

I think perhaps my metadata sheet is at fault, it's very big (hundreds of variables and variable types) and unfortunately I do not have permissions to share it here . Just to test, I took a subgroup of some of the buggy columns to make a mini-test dataset, to see if the smaller dataset also resulted in the same errors in the graphic box plots - it does not (the box plots represent all the values in all the correct ratios). So somewhere in my big dataset there must be some weird entries (that are leftover from data migration between servers and devices, etc) that are throwing things off. (Initially, I had just pasted the entire dataset into the .tsv sheet to get experience in running the moving pictures tutorial code on my dataset, but I will definitely take a closer look at everything.) Sorry I did not consider this earlier before posting on the forum!!

Many many thanks for your patience and helpful instruction. I will post the dataset here if I get permission at some point - thank you!

jairideout · January 12, 2018, 6:05pm

Thanks @slh277 for the updates and discussion! Let us know if you run into any more issues with feature-table summarize. One idea to check with your full data set is how many samples in your feature table are present in your metadata file. I think feature-table summarize ignores "extra" samples that are present in your metadata but not in the feature table -- that could be throwing the numbers off in those plots.

Happy QIIMEing!

system · February 13, 2018, 12:05am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.

jairideout · February 16, 2018, 5:09pm

In the QIIME 2 2018.2 release, Metadata interprets empty cells as missing data; the other names listed above will no longer be interpreted as missing data.

There are a number of other changes to QIIME 2 Metadata in the 2018.2 release. See this forum announcement for details on what changed, as well as the updated Metadata tutorial.