I noticed it seems to happen if there is missing data (marked by either a "." or "[blank] "), or if there is text like "NA" when the other values are 0/1 such as the below table example, from the Atacama tutorial FeatureTable (tutorial metadata link here):
a) is this because of the missing data? (I ran the Keemei add-on to Google Sheets and there were no issues with the metadata table)
b) is the graphical representation indicating that the sample-metadata.tsv file is not being read correctly? or is it just giving an example?
I think you are absolutely correct, and one or both of these things is happening; either data is missing and is being filled with this blank column and/or your metadata is not being read correctly by qiime.
Would you be able to post a link to your metadata file? I understand that this might not be possible or data could be private, and there are other ways for us to solve this problem. But I think this would be the fastest way for us to look for clues.
Thank you! Is there a particular way to format missing data in the spreadsheet?
I made a smaller version (the metadata file I have is quite large and yes, it is private) with some examples of variables that read correctly (column B) and ones that do not (column C, which does not have missing data but has that same unlabeled column in the graphical output; a categorical variable (column D) & a binary variable (column E) which DO have missing data indicated as blank, and also have the unlabeled column).
For the Atacama tutorial data, it is happening on columns K through T (i.e variables pH through Temperature) - which all seem to have both numeric and character variable data. Perhaps this mixing of char/num data is also an issue?
Hi @slh277 and @colinbrislawn! The NA values in the Atacama tutorial metadata (and the example metadata @slh277 linked) are being interpreted as missing values. The feature-table summarize plot is displaying the missing data in an unlabeled column (rightmost bar in the plot).
Up to this point, the QIIME 2 Metadata file format didn’t specify how to format missing data. It turns out that NA, among other values, are currently being interpreted as “missing data” (there’s an open issue tracking this bug). In the upcoming 2018.2 release, the QIIME 2 Metadata file format will allow empty cells to represent “missing data”, and all other values (including NA) will be interpreted as actual metadata values. We decided to only support the empty cell for “missing data” in order to avoid cases where NA represents a real metadata value such as “North America”, etc. in a study.
We’ll follow up here when the new Metadata file format (with “missing data” support) lands in a release!
Can I ask one more question? For the graph below, there are 0, 1, and empty cell/blank values in the metadata file (for n=53 samples).
However, as you can see below, only ~seven 1's show up on the graph, and the remaining forty-five values are all seen as missing, despite there being ten empty cells, thirty-three 1s, and ten 0s. Shouldn't all the values, including the blanks despite not being 'missing' per se, be graphed together (respectively)? just wondering how to get around this...
Since empty cells are not being seen as "missing data", currently, what are the ways missing data can be seen for the time being? Would it be a good idea to replace all empty/missing data cells with "NA" for the time being (as NA is not a real metadata value in my dataset)?
side note - in the next release, could missing data be indicated by empty cells, as well as periods (".") like how missing data are generally handled in SAS? I realize SAS may not be as commonly used in this area, but just putting it out there...
That output looks pretty suspicious… you may have found a bug in feature-table summarize. Can you please share your test data set with me so that I can try to reproduce the plot locally? I’ll need the metadata file, feature table, and the exact command you ran. Thanks!
In the current version of QIIME 2, both the empty cell and NA are interpreted as missing data, so you can use either. There are some other values that will currently be interpreted as missing data, and we’ll fix that in the upcoming 2018.2 release as well. These “missing data” values are the default values supported by pandas.read_csv, which is the TSV parser used to load Metadata (we won’t be using pandas to parse Metadata files in the next release, and only the empty cell will represent missing data). Here is a complete list of “missing data” values that are currently supported, in addition to the empty cell:
Thanks for the suggestion! While it would be cool to support additional “missing data” values used by SAS or other software, we’re only comfortable supporting the empty cell as “missing” to avoid clashes with values that users intend to represent actual data. Since QIIME 2 doesn’t enforce any standards for representing metadata, we think the empty cell is the only “safe” value we can reliably use to represent missing data, because it couldn’t possibly clash with a user’s “real” data. While a period is unlikely to represent actual data, we can’t guarantee that for all users, and if we support SAS “missing data”, that opens the door to supporting other “missing data” values used by R, pandas (see above), [insert my favorite software tool here], etc. Due to the lack of standardization in the field (actually, across all fields using delimited file formats, yikes), we’re avoiding taking any stance/preference on what values represent missing data. It’s restrictive and won’t make all users happy, but at least it’ll be predicable, easy to document and educate users about, and hopefully lead to more reliable analyses.
Thanks for bringing up these ideas! I think our discussion here will be a useful reference for other users having similar questions.
… totally understandable!! I appreciate your explanation.
I think perhaps my metadata sheet is at fault, it’s very big (hundreds of variables and variable types) and unfortunately I do not have permissions to share it here . Just to test, I took a subgroup of some of the buggy columns to make a mini-test dataset, to see if the smaller dataset also resulted in the same errors in the graphic box plots - it does not (the box plots represent all the values in all the correct ratios). So somewhere in my big dataset there must be some weird entries (that are leftover from data migration between servers and devices, etc) that are throwing things off. (Initially, I had just pasted the entire dataset into the .tsv sheet to get experience in running the moving pictures tutorial code on my dataset, but I will definitely take a closer look at everything.) Sorry I did not consider this earlier before posting on the forum!!
Many many thanks for your patience and helpful instruction. I will post the dataset here if I get permission at some point - thank you!
Thanks @slh277 for the updates and discussion! Let us know if you run into any more issues with feature-table summarize. One idea to check with your full data set is how many samples in your feature table are present in your metadata file. I think feature-table summarize ignores “extra” samples that are present in your metadata but not in the feature table – that could be throwing the numbers off in those plots.