Questions on data formats and parsing

garlic · February 6, 2020, 12:27am

Hello,

I am new to microbiome analysis, though I am experienced in other sequencing approaches like RNA-seq and de novo genome assembly.

I am designing an undergraduate course in microbiome analysis, and I have a couple of naive questions that I am having some trouble finding the answer to. I am attempting to design the lab exercises, but do not have any sequence data to play with yet, and am trying to anticipate some potential problems.

I am not completely clear on what format of data I will be importing. We are using the Qiagen 16S/ITS panel kit, which is intended for amplifying and sequencing several regions of the 16S simultaneously, along with fungal ITS. Our plan is to do 2 x 300 bp PE sequencing on MiSeq, with ~20 samples multiplexed.
I have noticed a format in the QIIME docs called "Earth Microbiome Project" format, which appears to have barcodes in a separate file from the reads. What I am not exactly sure about is how to know whether I will have EMP-formatted data. Is it dependent on the kit/primers that I use? If my data don't comply with EMP, would I just use the Casava 1.8 paired-end demultiplexed fastq format?
Regarding the Qiaseq 16S/ITS panel kit, I think that it might be overkill for the students (who generally have not worked with sequence data before) to work on each variable region, so I'm contemplating reducing the dataset to include V3V4 and ITS. I'm curious if anyone has used this kit, and if there is an easy method for pulling out specific variable regions of interest. Apparently Qiagen's CLC Genomics has this ability, but we do not have funds to purchase a subscription.

Thanks for any help!

jwdebelius · February 6, 2020, 9:26am

Hi @garlic,

Welcome!

Your course sounds good, and I hope it goes well.

The EMP format is based on older assumptions about your data. My best recomendation is to use the manifest format. I think everyone tries casava first because it sounds easy. And then they end up frustrated. So, some up front work to do the manifest tends to save frustration later, IMO.

If you have the primers, you can separate this out using cutadapt directly. (I dont have the exact link, but if you look on their site, you can find it.) I did this recently on a mixed region project and it seemed to work okay. (Im still working on the processing.)

Best,
Justine

garlic · February 6, 2020, 6:43pm

Thanks so much for your help, Justine

So, if I understand correctly (regarding using cutadapt to separate regions), I would first import the data, then demultiplex, and then use the trim-paired command with cutadapt with the primer sequences to generate artifacts specific to my regions of interest? Would I use --p-adapter-f and --p-adapter-r with my primer sequences?

Thanks again for your help!

jwdebelius · February 6, 2020, 6:51pm

Hi @garlic,

Check with your sequencing center: they will be better able to tell you if your data is delivered demultiplexed or not. Even if they do, you can do demultiplexing in cutadapt and then remove the primers. I think they're relatively similar steps. Then, you can import the data into QIIME 2 with the manifest format.

As I look more at the qiaseq kit, can I actually make the suggestion that if you're not sure you're going to work with all of it, you just start with a single region? A lot of people are trying the mixed regions and most people find that they're headaches! Based on that experience, I would pick a single region and work off of that.

Best,
Justine

garlic · February 6, 2020, 9:21pm

Thanks again for your advice, Justine. We are sequencing it in house (students are doing extractions and library preps), and I think that will get multiplexed R1 and R2 files.

I agree completely about using a single region next time. This kit is giving me nightmares, and I can't find any similar datasets (from 16S panel) to test the various steps on!

jwdebelius · February 6, 2020, 9:40pm

Hi @garlic,

In your shoes, Id ask the sequencing center to run casava or bcl2 (or whatever its called... sorry, I sit downstream of that step normally) for you. Otherwise, I think I'd try the dual cutadapt.

I have a suspicion of who to blame for the sudden explosion in multi-region kits. I happen to know of one multi-region 16s dataset: the example this paper from their repository.

Best,
Justine