Questions about importing multiple samples in fastq EMP PairedEnd format

Kiet_Ho · March 23, 2019, 7:47pm

Hi everyone,

I am an undergraduate student in computer science and recently I am working on a bioinformatics project: metagenomic analysis using qiime2. Our rRNA 16S sequencing data is publicly available at https://www.ebi.ac.uk/ena/data/view/PRJEB26635 (at Submitted files (FTP) column).

In short, our proposed workflow is as the following:
Import => Demux => dada2 (denoise and join pair) => Assign taxonomy => Phylogenetic tree => Analyse alpha and beta diversity (optional).

Unfortunately, since I lack of biology knowledge (I am computer science student), I have troubles choosing the right import settings for my files (i.e., are those files demultiplexed or multiplexed, should I use EMP, Casava or Fastq Manifest? Should I demux them after imported or not? etc.)

Thus, I tried to import them in both fastq manifest and emp formats. Then I realise that manifest format allows me to import multiple samples (18 in total, paired end), whereas emp allows to import one sample at a time (I use qiime to extract barcode to proceed this importing). However, as I understand (from reading the tutorial), manifest format assumes the imported files are already demultiplexed. Thus, I tried to demux them with the .qza file obtained from EMP importing. However, I get confused as the demux command implies that I should have a sample-metadata that contains barcodes of multiple samples; whereas EMP importing command only let me to import one sample of a time. Can someone please clarify these procedures of importing and demultiplexing for me?

P/S: My partner and I already read through the tutoring docs several times. Thus, merely saying "read the tutorial" won't help much. If possible, we prefer specific quotes from the tutorial docs so that we would know where we overlooked. Thank you in advance!

jwdebelius · March 23, 2019, 8:35pm

Hi @Kiet_Ho,

My best suggestion in this situation is to talk to your sequencing center for specifics. The tutorials unfortunately can't answer the specific question of the little idiosyncrasies of each sequencing center or group. So, the absloute best way to know is to ask! Plus, as an informatician, its always good to get to know the people who do the wetlab work because life is so much better when you can both appreciate each other's talents. Philosophy aside, they're the ones who can tell you if the data came demultiplexed or multiplexed.

If you're not able to get a hold of them, and it is the data from EBI, Im going to guess demultiplexed because AFAIK, EBI requires demultiplexed data. In that case, I think the manifest format will be your best bet. The casava version is really a specialized variant of the manifest format, and TBH, you get so much more control with the manifest format.

This also makes me think your data is already demultiplexed. (Backward sluthing ). If you have multiplexed data (i.e. that needs to be demultiplexed with the qiime demux emp-paired, you should have a metadata file/sample sheet/whatever you call it that maps a barcode ot a sample. If you don't have that information and you have multiple samples, you're somewhat sunk. But, this method only works if there are multiple samples and multiple barcodes in that fastq file. So, since your data is more than likely already demultiplexed, its likely why that didn't work.

Sorry if thats a confusing way to get to an explanation, but double check with your sequencing center for either the demultiplexed status or a sample sheet, and then, good luck!

Best,
Justine

Kiet_Ho · March 23, 2019, 9:23pm

Thank you so much for your help.

We will send an e-mail to EBI center to get a confirmation on this matter.

In a meanwhile, we will assume our data is demultiplexed and carry on.

system · April 24, 2019, 3:23am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.