Can anyone help me with data from sequencing center

Hello, my colleague gave me some of old sequencing data (MiSeq). I have never seen the format before. My own sequencing center usually gave us three file, Forward.fastq, Reverse.fastq, Index.fastq. which are easy.

The old data from other sequencing center, but split by samples. Each sample, there is a forward.fastq.gz and reverse.fastq.gz. Based on the QIIME tutorial here (Importing data — QIIME 2 2022.8.3 documentation), it looks like " Casava 1.8 paired-end demultiplexed fastq". However, the sequencing center doesn't provide index file.

If my old data are " Casava 1.8 paired-end demultiplexed fastq" format,

1> Is it possible for me to figure out the index/barcodes that are removed or not. If the barcordes are not removed, can I extract the barcodes? I am not sure the barcodes are on F reads or R reads or both.

2> If I can't extract barcodes, can I use the QIIME 2 DaDa 2 workflow?

3> If I can't extract barcodes, I don't have barcodes. Does barcodes is necessary for QIIME2 metadata file?

Thanks

Hello!
Some sequencing facilities send already demultiplexed reads. In that case barcodes are usually removed at demultiplexed step.
If the reads are in Casava 1.8 paired end demultiplexed format, barcode sequence should be indicated in the files names right after sample ID.
Barcodes are needed for demultiplexing step and not required to be present in the metadata file if reads already demultiplexed.
Demultiplexed reads can be imported to Qiime2 and denoised by Dada2 (primers can be removed by q2-cutadapt or trimmed in Dada2).

Best,

Hello! Thank you so much.

"If the reads are in Casava 1.8 paired end demultiplexed format, barcode sequence should be indicated in the files names right after sample ID."

Do you mean the Casava 1.8 format must have the barcode sequence on the file name? If so, I don't think I have this kind of file names. Here is an example of my file names:

210623Helob01_S100_L001_R1_001.fastq.gz
210623Helob01_S100_L001_R2_001.fastq.gz

I notice the example from QIIME 2 (Importing data — QIIME 2 2022.8.3 documentation). There is no barcode sequences on the name of Casava 1.8 fastq.gz file. Can you tell me the if the Casava 1.8 data's barcodes have been removed or not?

Or you mean the barcodes is in the fastq file. I check the head of the F and R reads files

head 210623Helob01_S100_L001_R1_001.fastq

@M01676:287:000000000-JV275:1:1101:15959:1387 1:N:0:ACTACGAC+CTGCGTGT

TACGAGTGCTTCGAGCGTTATCCGGAATCATTGGGCGTAAAGGGTGTGTAGGCGGCGTGGTTAGTCTTCTGTAAAATCCTTGGGCTCAACCGGGGGCTGGCGGTGGAAACGGCAGCGCTTAAGTCCGGGGGAGGTATCTGGATGTCAGGTGGTAGCGGGGAAAGGCGAGGATATCATGGGGAACACCAAAGGCGAAGGCAAGAAACTGGCCCGCTCCTGCCGCTGAGACACGAAAGCGTGGGGAGCGAATG

+

>>AA1>1>1BDFAGBE0AEEGGHGGCAFBGHHCFHGCEGEFFH/BCA/BEFGBEC/>E//?/1BBGF2F2BGFDGHHFG121BGFHHHGFG//<>//A/?/@-<<.<GD/E.C@CCC-C:.0:0009;?---9=-///9;:;9///:/;9/-9///-@-;@BE-9------/;//B/;/A-AB--;AA--/9--;-;-A9A--//;/FB-A-9=-@B/9/-;9--;-;/-B-;A-AB?BB?;--9---@;F

@M01676:287:000000000-JV275:1:1101:15442:1412 1:N:0:ACTACGAC+CTGCGTGT

CACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGACTGTCACGTCTGCTGTGAAAAGCTAGGGCTTAACCCTGGCCTTGCAGTGGATACGGGCAGACTAGAGGTAGGTAGGGGAGAGTGGAATTCCCGGTGTAGCGGTGAAATGCGCAGATATCGGGAGGAACACCGGTGGCGAAGGCGGTTCTCTGGGCCTTACCTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA

+

head 210623Helob01_S100_L001_R2_001.fastq

@M01676:287:000000000-JV275:1:1101:15959:1387 2:N:0:ACTACGAC+CTGCGTGT

CCCATTCGCGACCCACGCTTTCGTGTTTCAGCGTCAGGAGCGGGCCAGTCTCTTGCCTTCGCTTTTGGTGTTCCCCATGATATCAACGCATTTCACCGCTACACCATGAGCTCCAGAGACCTCTCCCGTCCTCAAGCGCTGCCGTTTCCTCCGCGAGCCCTCGGTTGAGCCGAAGGATTTTACAGAAGACTAACAACGCCGCCTACACACCCTTTACGCCCAATGATTCCGGATAACGCCCGAAGCACTCG

+

BBBBBFBFA2DBGGGGCGGFGGGGGHFGHHGHGGGGGHGGHGGGGGGFHFHHDGHHHHHHGG>FFHBEEFHGHFFHCGHHHHHHGHHGGGGGHHHHHGGGGGHGHHHGHGBGGHB1?CGHGHHFH0FDFDAGHHGG1.ACCEFFDGHHFFFHGGG@FAAGGGADGDFFBFFD---AFFFFF/BF/9/9F/F/BFFBE>DB@F>D.BBFFFFFFFFFB.9>DFEF/BBBFFF-A.;9;.----;;-9BFB/.

@M01676:287:000000000-JV275:1:1101:15442:1412 2:N:0:ACTACGAC+CTGCGTGT

CCTGTTCGCTCCCCACGCTTTCGCTCCTCAGCGTCAGGTAAGGCCCAGAGAACCGCCTTCGCCACCGGTGTTCCTCCCGATATCTGCGCATTTCACCGCTACACCGGGAATTCCACTCTCCCCTACCTACCTCTAGTCTGCCCGTATCCACTGCAAGGCCAGGGTTAAGCCCTAGCTTTTCACAGCAGACGTGACAGTCCGCCTACGAGCTCTTTACGCCCAATAATGCCGGACAACGCTTGCGCCCTACG

Do you think "ACTACGAC+CTGCGTGT" is the barcode?

I just need to figure out if the barcodes have been removed or not. My file name per se doesn't have any barcode information.

Oh, I need to apologise - I thought that barcodes in the name are the requirement, but now I see that you have different information there.

By default barcodes should be removed by sequencing facility at the demultiplexing step, unless it was specified not to do so. Since your sequences already demultiplexed and you did not ask for barcodes to be in sequence, and sequencing renter did not provide you with specific information about it, barcodes should be already removed.

I would remove primers by q2-cutadapt and proceed to Dada2 after.

Best,

1 Like

Hi, thank you very much.

1> Yes, from the file name (without barcode sequence on it), the barcodes are supposed removed by the sequencing center. However, I was told these samples were sequenced using 2X250bp MiSeq. I checked the raw fastaq files, all the sequence length in the file are ~ 250bp.

If the adapters/index barcodes were removed, shouldn't sequences shorter than 250bp?

2>"I would remove primers by q2-cutadapt and proceed to Dada2 after."

I am a little bit confused about the workflow after the barcodes removed. I read the QIIME instructions here about removing non-biological sequences here (QIIME 2 for Experienced Microbiome Researchers — QIIME 2 2022.8.3 documentation)

What I understand q2-cutadapt (the purpose) is to remove primers (I need to provide primer sequences)? Am I correct? Or this will automatically check any non-biological repeated sequences (without providing sequences)?

The instructions say "
The q2-cutadapt plugin has comprehensive methods for removing non-biological sequences from paired-end or single-end data.

If you’re going to use DADA2 to denoise your sequences, you can remove biological sequences at the same time as you call the denoising function. All of DADA2’s denoise fuctions have some sort of --p-trim parameter you can specify to remove base pairs from the 5’ end of your reads. (Deblur does not have this functionality yet.)"

I plan to use DADA2 workflow. If I use DADA workflow, do I need to do q2-cutadapt before DADA2? It seems the instructions suggest not. What do you normally do? Both? or just DADA2.

Thanks

1 Like

In my samples barcodes were in the F reads, so after demultiplexing it looks like this:

If the length of both F and R reads is 250, then barcodes are not removed indeed.

Cutadapt will remove primers and barcodes from sequences. You need to provide corresponding sequences to it.

It is one way of handling primers and adapters - you can specify trimming parameters to cut the sequences at certain position (you need to choose a length of primers and barcodes).

Cutadapt is a better alternative to remove primers/adapters from sequences before dada2.

In your case, since you don't know, which barcodes were used, and if they were removed or not, I would suggest to use cutadapt to remove forward and reverse primers that were used in library preparation and discard any reads without those sequences. It will remove also any sequence that precede the primers, so barcodes should be removed from the sequence at this step.

1 Like

“If you’re going to use DADA2 to denoise your sequences, you can remove biological sequences at the same time as you call the denoising function. All of DADA2’s denoise fuctions have some sort of --p-trim parameter you can specify to remove base pairs from the 5’ end of your reads. (Deblur does not have this functionality yet.)"

“Cutadapt is a better alternative to remove primers/adapters from sequences before dada2.
In your case, since you don't know, which barcodes were used, and if they were removed or not, I would suggest to use cutadapt to remove forward and reverse primers that were used in library preparation and discard any reads without those sequences. It will remove also any sequence that precede the primers, so barcodes should be removed from the sequence at this step.”

Thank you very much. Just make sure I understand this correctly.

1> I will do the q2-cutadapt first. Since I know the primer sequences, I can provide corresponding sequences to it. I’ve used DADA2 denoise –p-trim before, but I never used cutadapt. I suppose I need to check my raw data and see where my primer sequences end and tell cutadapt which position to remove, if I want to remove both primer and unknown barcodes. Usually, the barcode is in front of the primer. Or I only need to provide the primer sequences to q2-cutadapt?

2> If I have cleaned up all of the barcodes and primers at the q2-cutadapt step, I don’t need to –p-trim for DADA2 denoising. Am I correct?

3> I was told these old data using 2X250bp MiSeq sequencing and the primers are EMP 515F- 806R (291bp in between). I usually use 300bp X2. As you know, DADA 2 workflow always try to join the forward and reverse reads. I don’t know how many bp that I will trim. If I only leave 200 bp of F and R reads, would be safe to join them together? (I don’t want to have a lot of errors of joining)

  1. You need to provide sequences of primers to remove them.

  2. Correct

You need at least 12 nt in the overlapping region between reads to be merged. That's mean the with length approximately 290 nt even 160 - 170 nt after truncation should be enough for reads to be merged.

So I would truncate around position 170 and then check Dada2 stats to see how good it is.

Thank you very much!
"You need to provide sequences of primers to remove them." I can provide the primers sequences, but I don't know barcodes. So, there is no way that I can provide the barcode sequences. You mentioned, "I would suggest to use cutadapt to remove forward and reverse primers that were used in library preparation and discard any reads without those sequences. "

Is it enough to remove both barcodes and primer sequences, if I only provide primer sequences.

I check the q2-cutadpt manual here (cutadapt — QIIME 2 2022.8.3 documentation). I suppose I should use trim-paired script but not demux-paired.

If I use trim-paired script, it needs adapter sequences. Should I provide my primer sequences, since I don't have adapter sequences.

Also, is there any "real example" of cutadpt scripts? Based on the manual, it keeps saying 5' - 3'. I am quite confused about the orientation of primer sequence. I think everyone write it from 5' to 3'? Can you provide a cutadpt scripts that you usually run to remove barcodes/primers etc.?

Thanks,

Yes, as I already wrote and as it is written in the q2-cutadapt description it will remove the primers (provided) and any preceding sequenses (including barcodes in that case).

That's right - this time you should use trim-paired script since the purpose is to delete primers from already demultiplexed reads.

That's an example from my pipeline

qiime cutadapt trim-paired \
    --i-demultiplexed-sequences demux.qza \
    --o-trimmed-sequences trimmed.qza \
    --p-cores 6 \
    --p-front-f CAAGRGTTHGATYMTGGCTCAG \
    --p-front-r TGCTGCCTCCCGTAGGAGT \
    --p-match-adapter-wildcards \
    --p-discard-untrimmed \
    --p-match-read-wildcards

All primers are provided in the 5'-3' orientation.

PS. Since this thread is becoming too long and the point of discussion already diverging from the original question, please create new topic for new questions (after performing quick search on the forum to make sure that this question is not already answered). This approach may help other users who'll encounter the same issues in future).

Best,

An off-topic reply has been split into a new topic: Cutadapt primers orientation and further steps

Please keep replies on-topic in the future.