Import from multiple GaIIx lanes

LauraMason · December 21, 2018, 9:10pm

Continuing the discussion from Import Paired End Illumina Data:

Hello,

I have paired-end, demultiplexed data from Illumina GaIIx sequencing. Similar to above, the samples are spread across multiple lanes, and I plan to create a FeatureTable of each lane and then merge. However, I am unsure of how to import the data, as the file structure and name are not like the tutorials.

File structure and name: each lane file contains a separate file for each sample. Each sample file contains with file names similar to: s_2_1_sequence.txt, and the same file name is present for each sample (sample a bin 1 and sample b bin 2 have the same named files with presumably different info). I would also really rather not rename all of these files by hand.

So overall:

What is the best way for me to import these files? (Casava?)
Do I need to rename or restructure this data?

Thanks!
Laura

Nicholas_Bokulich · December 31, 2018, 11:25pm

Casava is only used for files in that specific naming format. Use the manifest formats instead (see the importing tutorial for more information).

No — as long as it is fastq data it should work, and the manifest format is flexible for naming conventions (because you point to each individual file in a "manifest file").

Let us know if/why the manifest format is not working for you and we can help!

I hope that helps!

LauraMason · January 2, 2019, 10:10pm

Hi
Thanks for your help! However, I was mistaken - these reads are mulitplexed paired-end fastq files, and do not seem to be compatible with a Fastq Manifest import. I'm considering trimming in Qiime 1 (per How to demultiplex fastq file that still includes Barcodes and LinkerPrimer? - #3 by Nicholas_Bokulich) but that seems to leave me in the same position I'm in now. (Data structure will be difficult to deal with, and it does not seem that qiime1 has a Fastq manifest function?)

SO: I altered my data structure, renamed all files, concatenated, and renamed file type so that the data matched the EMP protocol. The issue I'm having now is that when I try to import, I get the following error:
There was a problem importing SD_Lane1: SD_Lane1/barcodes.fastq.gz is not a(n) FastqGzFormat file: Header on line 5 is not FASTQ, records may be misaligned

I've checked the header and this is what I have:
SampleID BarcodeSequence LinkerPrimerSequence
BM1Na CGTGAT caagcagaagacggcatacgagat
BM1Fa ACATCG caagcagaagacggcatacgagat
BM2Na GCCTAA caagcagaagacggcatacgagat
BM2Fa TGGTCA caagcagaagacggcatacgagat
BM3Na CACTGT caagcagaagacggcatacgagat
BM3Fa ATTGGC caagcagaagacggcatacgagat
BM4Na GATCTG caagcagaagacggcatacgagat
BM4Fa TCAAGT caagcagaagacggcatacgagat
BM6Na CTGATC caagcagaagacggcatacgagat

Any thoughts? Should I add more info to my barcodes file?

Thank you
Laura

thermokarst · January 3, 2019, 4:44pm

This looks like QIIME 2 Metadata, not a fastq.gz file. Do you have a separate barcodes.fastq.gz file? This is often referred to as the EMP Protocol. If not, are your barcodes still in the reads? If so, you could check out q2-cutadapt to demux these reads.

LauraMason · January 3, 2019, 5:16pm

Hi
It looks like my barcodes are still in the reads so I will check out the link & get back to you
Thanks
Laura

LauraMason · January 8, 2019, 1:46pm

gah - I was right the first time, and then misread. These are demultiplexed fastq files, and I have begun to upload them using the fastQ manifest tutorial, removing the adapter on the forward read using cutadapt as you suggested, and trimming. This all seems to be going fine.

However, it takes over 8 hours to import a lane. I think that this is because I am running a UBUNTU subsystem on Windows 10 home, and I've read that this does not work well with qiime2. Is this correct? I've been using this set up on smaller data sets (a 454 project) but did not encounter this issue. I've already looked into using Docker, but that package seems to require Windows 10 Pro.

Any advice would be helpful!

Thanks
Laura

thermokarst · January 9, 2019, 4:54pm

I am not aware of any performance related issues when using the Windows Subsystem for Linux, but I haven't used it myself.

The amount of time required for steps like this are going to depend on so many different factors - number of reads, number of samples, computer specs, etc. :qiime2: