In my rush to try out the new filter-seqs command I made a bit of a headache for myself by not having a header line in my metadata and losing my first sequence. I wasn’t using a where clause, so just using the first column as ids. In this case, would it be possible to test the header line to see if it is a valid id in the data to be filtered and either retain it (perhaps too clever) or throw a warning?
That’s a really interesting idea! The tough thing with TSV is there’s not a good way to know what is or isn’t a header line. But your idea of using the presence of your header label in the data you are trying to filter sounds like a pretty good clue that something went wrong. And I think this generalizes to metadata with columns (although people are much better at labeling in that situation).
That’s a neat idea, and thanks for getting in touch @Chris_Hemmerich!! We’d need to do some work on the framework to support this, because QIIME 2 handles the loading of metadata, which is the time when the header is processed. The framework would need to somehow understand what IDs are in a given data structure (e.g. biom.Table, skbio.DistanceMatrix, FASTA/FASTQ file, etc.) for the method/visualizer being executed.
This sounds possible to implement, but it’s a sizable task. Aside from the task size itself, I’d prefer not to support this behavior because QIIME 2’s metadata file format spec purposely only includes files with headers. It’s generally not a good idea to use headerless files in analyses (metadata or otherwise), so we’re trying to encourage users to associate a header with their data.
Note: we added support for headerless taxonomy files (e.g. mapping feature IDs to taxonomic annotations) but that was only for backwards-compatibility with existing QIIME 1 taxonomy files and reference database taxonomies. Since QIIME 1 didn’t support headerless metadata files we don’t have existing metadata that needs this backwards-compatibility.
Thanks for the prompt reply. It sounds like the work involved in generating the warning would be disproportionate to the benefit (which would not be 100% sensitive or specific anyway). But I still think having the ability to create an ad hoc metadata file by simply listing some ids and not specifying a column creates a cognitive tripping hazard for not also creating an ad hoc header that is digested by QIIME 2, but not used.
What about being more strict and always requiring an index or where when filtering by metadata? So this example from the filtering tutorial:
I’m not sure you want to inflict that extra typing on people at this point in development, but it would allow you to catch headerless files up front and consistently and smooth over the special case of single column metadata.
Those are interesting ideas @Chris_Hemmerich! I agree that the current behavior is a potential cognitive tripping hazard It also doesn’t match how qiime1 works with ID-based filtering which is confusing.
I like your idea of having the user specify the ID column to filter by, but we actually already require the first column in a metadata file to be the ID column. Adding a --p-filter-by parameter would just be extra typing at that point, but it would technically solve the issue because qiime2 could raise a warning or error in that case.
Taking your idea one step further, we could modify the qiime2 metadata file format spec to require a specific column name (or a set of reasonable names) for the ID column (first column in the file). That’d let qiime2 detect missing or invalid headers and notify users appropriately. In practice this change shouldn’t affect too many users because the qiime1 #SampleID column name still seems to be used a lot. We could also support other common permutations, such as SampleID, FeatureID, etc. Those are just my thoughts though, no formal decision has been reached yet. What do you think?
We’re also discussing internally how best to proceed – I’ll follow up here when we’ve had a change to stew on it. Thanks for bringing this up, this is definitely a tripping point we hadn’t considered!
I like the idea of constraining the the first column header. Well, “like” isn’t the right word. Getting an error because I didn’t pick the right incantation of header name for a column that has to be an ID is slightly annoying when learning a new tool. But I’d rather be slightly annoyed than confused. As your examples of common permutations touch on, you could also use the ID header name to type metadata files - e.g. filter-samples could croak early if you passed it a metadata file with FeatureID as the first header column.
Thanks for taking the time to discuss this with me. Even if these ideas don’t end up being a net win, the discussion has been helpful.
I agree there’s definitely potential for annoyance with this trade-off between flexibility vs explicitness. Fortunately the “fix” is easy in this case for users, because it would require changing a single cell’s value to match one of the expected column names.
That would be so awesome and would solve a whole case of user errors. I’ll bring up your idea in our discussions, thanks @Chris_Hemmerich!