improving fastq import experience for Galaxy users

Done.

Just a little feedback from a recent check of the usage stats on usegalaxy.eu. The import tool has been used a bit more than 7k times (which is cool). But most other tools less than 10x (only a very few tools have been used a few dozen times).

In my opinion this clearly shows that the import tool is not really usable (for Galaxy users). It offers a multitude of options, whereas (IMO) almost all users just want to import fastq data (maybe with a few filenaming formats). Or make the most frequently used options the default (or top coices).

For most of the options it seems also to complicate to find any docs.

See also Reorganize/improve import tool · Issue #37 · qiime2/galaxy-tools · GitHub

I guess it would also help to link the excellent q2 tutorial in the GTN (Link dada2 and qiime2 tutorials · Issue #4626 · galaxyproject/training-material · GitHub). But overall my feeling is that a simplified version of the import tool would help a lot.

6 Likes

Thanks for this input @bernt-matthias! We'll discuss this topic in our engineering meeting this week. We're definitely aware that this is one of the most challenging steps for users, both through Galaxy and through other interfaces.

3 Likes

@bernt-matthias, I'm splitting this off into a new Developer Discussion topic.

@Oddant1 and I spent some time on this today and we think have a path forward that should be much simpler for users.

Over the next few days, @Oddant1 is going to create a Galaxy-specific import tool for single-end/paired-end fastq data. We think, based on some experimentation that we did today, that it should be possible for Galaxy users to automatically import from a Collection where the identifiers are the user's sample ids, and files are annotated as forward or reverse (if paired-end).

Here's how we see this looking, with a couple of notes/questions indicated in bold.

  1. Click Upload Data, select Collection, if paired-end data change Collection Type to List of Pairs, and add the local or ftp files. Click Start, and then Build.

  2. Next, enter patterns to indicate which are the forward and reverse reads (I entered _R1_ and _R2_). At this stage, Galaxy correctly aligns the forward and reverse read data sets. Question 1: Is it possible to pair all of the data sets at this stage with a single action? The only way we were able to get that to work was to manually click each Pair these datasets buttons, which worked fine but could be tedious if there were (e.g.) hundreds of pairs here.

  3. After pairing all of the data sets, we change the identifier for each pair as indicated in the following screenshots.
    a. Note: at one point, these new identifiers didn't propagate into the collection, but it seems to be working this time. Are there any known issues with this process? It's also possible this was just something we did wrong - we were experimenting with a lot of different things.
    b. Feature request: We kind of stumbled on this way of changing identifiers by accident while clicking around, as it's not really obvious that that is an editable field. If these could look like unpopulated textboxes with the default identifiers in them, I think that would make this more obvious to users. Happy to create an issue somewhere if you think that's a useful feature request - just let me know where if so.
    c. Feature request: it would be amazing if there was a field like those where I have entered _R1_ and _R2_ that allowed the user to define where the identifier was in the file names. For example, in the case I have illustrated here, if the user could provide a regular expression ([^_]*), that could automatically extract the identifiers. This is what we ended up doing when we ran into my Note 3a above, using Collection Operations > Apply rules > + Column > Using a Regular Expression. This could of course be optional, but it seems like it could save users with regex experience from the potentially tedious process of manually relabeling pairs using the existing process I outlined here. (Again, happy to open an issue somewhere if that's helpful.)


    We are assuming these would all now be the identifiers in the middle column:

  4. At this point, there is a Collection of fastq files where identifiers are associated with each file, and if the data are paired end, the read direction is also associated with each file. That's all the information we need for importing per-sample modern Illumina (i.e., Phred offset 33) fastq files, which is probably >>99% of our users needs. So our tool will work from that Collection to import QIIME 2's SampleData[SequencesWithQuality] or SampleData[PairedEndSequencesWithQuality] without additional information required from the user, and we can expand to additional use cases from there as needed.

@bernt-matthias, thanks again for this feedback. We knew this was a challenging process for users, but the numbers you put on it made it clear that it's really a blocker for most or all users.

3 Likes

For the filters, it seems that you can already use regular expressions, e.g. L..._R1_.... It seems that the text matching will be removed from the element identifiers. This might already by handy for your example.
Apart from this the rule builder will be your friend at the moment:

If input of a paired collection would work it would be a huge improvement.

I guess with all this __q2galaxy__GUI__cond__add_ext__ will also be history? Otherwise I would have suggested to replace the complete by a text parameter (empty would mean that nothing is removed).

You are right the dialog may be improved. A summarized some ideas here.

Question 1:

Yes, there is this link: Auto-pair. This does the trick. Could be more prominent. I will open an issue.

Are there any known issues with this process? I

Not that I know of. If you find any problems/oddities/or things that could be improved you can always open an issue at the Galaxy repo. The Galaxy team is happy to get feedback.

I will have to experiment with the other points. I did not know this myself :slight_smile:

2 Likes

Thanks @bernt-matthias. As we work on this tool we'll experiment with the rule-based uploader, filtering to sample identifiers, etc. and submit any feature requests/feedback on the Galaxy issue tracker. We're expecting to have this new tool in place for the 2024.2 release of QIIME 2 and q2galaxy, which should be out in mid-to-late Feb.

@Oddant1, can you comment on that?

@bernt-matthias for this fastq specific import tool at least I don't think we will have any need for __q2galaxy__GUI__cond__add_ext__.