Detecting FASTQ variant types

Mehrbod_Estaki · September 15, 2017, 8:00pm

As briefly suggested here just starting a discussion as whether its possible to implement some sort of fastq variant detection method for those cases where an unknown fastq file is inherited to the user.
The closest to this inquiry that may help a potential developer I found was on this Biostar thread.

ebolyen · September 18, 2017, 8:12pm

Thanks for the link!

I think one of the main hurdles to accomplishing something like this is the way QIIME 2 formats and provides data to plugins.

One of the guarantees you get when you write a plugin for QIIME 2, is that the data is well formed and known. This is the exactly wrong expectation when you are trying to identify something unkown. So if this functionality existed, it wouldn’t work as a method/visualizer, as you’d already need to have known the PHRED offset to reach the point where the “detector” could run.

There is another place where something like this could occur, and we don’t support it yet, but it’s not terribly far out of reach. As we start implementing more formats and validation, it becomes possible for QIIME 2 to “test” a file to see if it matches a format. If QIIME 2 tested an unknown file against all formats, each format could indicate whether or not it looked like their kind of file.

This works pretty well when your files are very different, e.g. figuring out if something is a newick file or a fasta file is pretty straight-forward, only one of those formats will agree that the mystery file belongs to them. But for variants of a FASTQ file it definitely becomes harder, and like your Biostar’s thread indicates, the best you can really do is “guess”. It is entirely possible for a PHRED-64 file to stay within range of a PHRED-33 file, so while you can certainly be suspicious that your resolved quality scores tank around 30 instead of 0, you can’t prove that it isn’t PHRED-33.

Basically, this isn’t impossible, but it is hard. From my personal view, it seems like there’s also a fundamental issue of what kind of reproducibility is possible to expect when even the sequencing instrument is unknown. There’s likely other important contexts to that data-set that are missing as well.

In any case, thanks for getting this discussion started!

Mehrbod_Estaki · October 23, 2017, 3:55am

Recently I was browsing around in Qiime1 and came across this that I thought might be helpful here.
http://qiime.org/scripts/split_libraries_fastq.html

--phred_offset
The ascii offset to use when decoding phred scores (either 33 or 64). Warning: in most cases you don’t need to pass this value [default: determined automatically]

Sounds like Qiime1 was using some automatic detection method between these 2 types, perhaps the methods in the codes there might be useful yet here.

Aqleem12 · October 23, 2017, 9:39am

(post withdrawn by author, will be automatically deleted in 24 hours unless flagged)

jairideout · October 24, 2017, 12:09am

QIIME 1 detects the phred offset by looking at the FASTQ headers – if the headers appear to have been produced by a new enough version of the Casava software (1.8), it’ll assume phred 33, otherwise 64. QIIME 1 doesn’t look at the quality scores at all during detection. We could implement this detection – it won’t be perfect but if a user’s FASTQ headers match the expected pattern they won’t have to worry about the phred offset.

@ebolyen what do you think about having this as part of the format detection when we have the framework to support it?

Mehrbod_Estaki · May 13, 2019, 11:15pm

Just wanted to add something to this deeply burried topic.
I just realized that vsearch already has a script that tries to guess the fastq variant based on the ASCII characters: vsearch --fastq_chars.
This can either be allowed via q2-vsearch for users who don’t know their variant type or more ambitiously incorporated into some sort of auto-detect upon import.