Sample-Metadata FeatureTable Graphs - question on unlabeled column

jairideout · January 11, 2018, 7:40pm

That output looks pretty suspicious... you may have found a bug in feature-table summarize. Can you please share your test data set with me so that I can try to reproduce the plot locally? I'll need the metadata file, feature table, and the exact command you ran. Thanks!

In the current version of QIIME 2, both the empty cell and NA are interpreted as missing data, so you can use either. There are some other values that will currently be interpreted as missing data, and we'll fix that in the upcoming 2018.2 release as well. These "missing data" values are the default values supported by pandas.read_csv, which is the TSV parser used to load Metadata (we won't be using pandas to parse Metadata files in the next release, and only the empty cell will represent missing data). Here is a complete list of "missing data" values that are currently supported, in addition to the empty cell:

#N/A
#N/A N/A
#NA
-1.#IND
-1.#QNAN
-NaN
-nan
1.#IND
1.#QNAN
N/A
NA
NULL
NaN
n/a
nan
null

Whew!

Thanks for the suggestion! While it would be cool to support additional "missing data" values used by SAS or other software, we're only comfortable supporting the empty cell as "missing" to avoid clashes with values that users intend to represent actual data. Since QIIME 2 doesn't enforce any standards for representing metadata, we think the empty cell is the only "safe" value we can reliably use to represent missing data, because it couldn't possibly clash with a user's "real" data. While a period is unlikely to represent actual data, we can't guarantee that for all users, and if we support SAS "missing data", that opens the door to supporting other "missing data" values used by R, pandas (see above), [insert my favorite software tool here], etc. Due to the lack of standardization in the field (actually, across all fields using delimited file formats, yikes), we're avoiding taking any stance/preference on what values represent missing data. It's restrictive and won't make all users happy, but at least it'll be predicable, easy to document and educate users about, and hopefully lead to more reliable analyses.

Thanks for bringing up these ideas! I think our discussion here will be a useful reference for other users having similar questions.