Import Casava demultiplexed sequences with project name

jwdebelius · March 15, 2018, 5:43pm

Im working with paired end reads that were demultiplexed using Casava. The file name follows the format of [project id]\_[sample id with underscore]_L001_[read id]_001.fastq
For instance, p01_s01_20180315_L001_R1_001.fastq.

I can load the files as a single or paired end artifact, do quality filtering, and get a demuliplexing summary. But, deblur throws a duplicate ID error when I try to run them through. After a far bit of detective work I determined the work around to the problem is to rename my files so they follow the format of [project id].[sample id with underscore]_L001_[read id]_001.fastq
so, my example now looks like p01.s01_20180315_L001_R1_001.fastq

I tried using p01.s01.20180315_L001_R1_001.fastq, but this could not be loaded into QIIME 2 as a casava file. It would be really helpful if there was a flag on the qiime tools import command to specify an underspace delimited project id.

gregcaporaso · March 16, 2018, 8:17pm

Hi @jwdebelius,
I haven't seen a project id in files that I've worked with before that were generated by Casava. Do you know if that file naming variant is covered in the Casava 1.8 user manual (or other documentation) anywhere?

This importing step is challenging for us to simplify as there is a lot of variation in how demultiplexed fastq files are named when users receive them. The Casava 1.8 format represents one way the files are named, but there are countless others. To address that issue, we defined the fastq manifest formats which have no restrictions on the individual file names, and allow the sample ids to be specified independently of the filename. I would recommend using the fastq manifest format to import your demultiplexed fastq data instead of the Casava 1.8 formats. The record in your fastq manifest file for this sample would look like:

p01.s01.20180315,$PWD/p01_s01_20180315_L001_R1_001.fastq,forward

This would specify that you want the sample id to be p01.s01.20180315 (but you could change that to anything you want - see our recommendations for creating sample ids), the filepath is $PWD/p01_s01_20180315_L001_R1_001.fastq, and that these are forward reads.

We ultimately hope to simplify the generation of these fastq manifest files by allowing a user to describe (maybe via some GUI) how their sample ids map to their file names.

Hope this helps!