Determine minimum read length of demux artifact

I have a demux object and I want to determine what is the shortest read in it-- and in general be able to see the read lengths. How can I do this (preferably using the Artifact API)?

Hi @mahermassoud,

Sorry for the delayed reply.

Awesome! I think the following should do the trick (I'm assuming single-end, but there's an equivalent paired-end format):

from q2_types.per_sample_sequences import SingleLanePerSampleSingleEndFastqDirFmt, FastqGzFormat
import skbio.io

# We don't have a lot of other kinds of views for demux, as most of our plugins just
# use the filesystem, so this isn't really as well-developed as some of the other
# artifact types (like feature-table or most feature-data artifacts)
directory = your_artifact.view(SingleLanePerSampleSingleEndFastqDirFmt)

# at this point you could just call `.path` or `str()` on `directory` to get the
# location of all your reads and do everything manually, but we can save you a whole
# step of globbing for your sequence files:

lengths = {}
# directory.sequences is a BoundFileCollection and has an API you can play around with
# we're going to use `.iter_views` to find all of its elements (and view).
for file_name, format in directory.sequences.iter_views(FastqGzFormat):
    # `format` is now another proxy for a filepath (just like `directory`) so you 
    # can call format.path or str(format) to get the real filesystem location

    # you can substitute this with any python you want, here's an example with skbio
    seqs = skbio.io.read(str(format), format='fastq', phred_offset=33)
    lengths[relative_path] = min(len(s) for s in seqs)

print(lengths)

If we had more transformations available, we could have potentially used better object representations, instead of SingleLanePerSampleSingleEndFastqDirFmt or even FastqGzFormat, alas, those are all we have to work with at this time.

It would have been really neat if we could have said something like: directory.iter_views(DNAIterator) and gotten a bunch of ready-to-go sequence iterators for every sample's fastq file.

Hope that's helpful!

1 Like

Also, something to note:

When you create a view, we can only promise it will "work" for as long as the artifact that created it is in scope. Some views get to use the mounted filepath of the artifact as their backing data, and so if the artifact goes out of scope, that mounted filepath also disappears leaving your poor view with no data for it to look at :broken_heart:

1 Like

Thanks!

I think encountered this issue a few weeks ago and I was wondering what was going on.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.