Bz2 FeatureData[AlignedSequence]

Stefan · February 22, 2018, 9:05pm

Hi Qiimer,
I am importing a multiple fasta alignment file into an FeatureData[AlignedSequence] archive from a bz2 compressed version of this file align.txt (150.4 KB) which works fine alg.bz2.qza (10.3 KB) and is handy since I don't have to decompress first:

bzip2 -z align.txt
qiime tools import --input-path align.txt.bz2 --type "FeatureData[AlignedSequence]"  --output-path alg.bz2.qza

However, when using this file as an input for the q2-fragment-insertion plugin it leads to an error, because when decompressed temporarily it results not in a raw fasta file but it is still compressed, which the underlying bash script cannot convert.

If I don't compress and import the alignment everything works fine with this artifact.
alg.raw.qza (13.6 KB)

Is that a feature or a bug? Is it the obligation of the plugin to uncompress the data or should that be done by qiime?

jairideout · February 23, 2018, 5:56pm

Hi @Stefan! FeatureData[AlignedSequence] currently only supports uncompressed FASTA files. I created an issue on q2-types to add .bz2 support. When defining a file format and transformers associated with a semantic type, it's up to the developer of the format/transformers to decide what compression schemes (if any) are supported. Thus, .bz2 support would need to be added to q2-types because that's where the format and transformers are defined.

Note: the FASTQ manifest formats are an example of a format supporting both compressed and gzipped FASTQ files (those formats are also defined in q2-types).

If this is a feature you'd like to have in QIIME 2, pull requests are welcome! Either way, we'll follow up here when the feature makes it into a QIIME 2 release, though I don't have an ETA (it's pretty low priority for us). Thanks!

Stefan · February 23, 2018, 6:15pm

I think we have a little misunderstanding. What I did is importing a file (which is a compressed bz2 fasta file) into an qiime2 artifact - without any complains, but it should have rejected importing this file because it cannot be properly parsed.

ebolyen · February 27, 2018, 1:00am

Thanks @Stefan,

I think what is happening is scikit-bio is probably involved somewhere, it automatically detects the compression used, so bz2 would be transparent to a validator/sniffer implemented with that, leading to the issue you are seeing.

In fact I think I see the cause here.

I've raised an issue on q2-types about this. Thanks for reporting!

system · March 30, 2018, 7:01am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.