Extract the barcodes from the paired-end reads

Hello!,

I am a beginner and studying Qiime2 recently. I received Illumina R1 and R2 with quality information. I want to create a sample metadata to proceed with the next steps. Could someone share with me how to extract barcodes from the paired-end reads using Qiime2?

Cheers!

1 Like

Hi @muluoljira!
I’m not exactly clear on what format your data is in, but there are some awesome tutorials that will probably help you with data import and some basic workflows you might build on. (Check out the Moving Pictures, Atacama, and Parkinson’s Mice tutorials for a few different flavors of basic analysis.)

Give those a read, see if you can figure things out, and let us know if you have more specific questions.

Happy :qiime2:-ing!
Chris :bird:

3 Likes

HI! I have the same type of raw data and it seems to be Casava 1.8 paired end demultiplexed. That worked for me, but I only have one sample. I also would need to know how to extract the barcodes from the sequences for the metadata file.
Cheers!

Welcome to the forum, @Melisa_Olivelli! q2-cutadapt can be really helpful for removing barcodes from sequences. There’s plenty of info about it here on the forum (check out the :mag: search feature!), including this short-form tutorial, and the docs have a great section on available plugins that will give you the official documentation.

Have a great weekend!
Chris :partying_face:

1 Like

Thank you for your awesome response. I have checked out the flavored sauce of Moving Pictures, Atacama, and Parkinson’s Mice tutorials. My raw data format seems neither Casava 1.8 nor EMP. I am trying to sort out with “Fastq manifest” formats. I created fastq manifest file for paired-end read data however, I couldn’t understand how to decide on the assumptions to use ‘PairedEndFastqManifestPhred33’ or ‘PairedEndFastqManifestPhred64’.
Bests!

2 Likes

Sounds like you’re making progress, @muluoljira. Quality scores in FASTQ data are written using an “alphabet” of 43 characters. In this context, “33” and “64” describe the first ASCII character in the block of characters used in quality scores. (Every ASCII character has a number associated with it. 33 is ! and 64 is @, so 33-formatted quality scores are written with the characters from ! to K, while 64-formatted scores use @ to j.)

This bit on Quality Score Variants, which I found in the importing tutorial, gives a high-level overview of which machines use which variants, and is probably your best bet. If you don’t know what equipment was used in sequencing, you could ask your sequencing center.

If that’s not possible, you could preview some of your raw data (e.g. less my_data.fastq) and compare the quality-score characters to the characters in each format’s group of accepted characters. Hopefully it doesn’t come to that! :crossed_fingers::grin:

Best,
Chris :crab:

1 Like

@muluoljira, the inimitable @Mehrbod_Estaki shared another useful tool that might help you determine which kind of data you’ve got:

vsearch is included with QIIME 2, so you should be able to use it by activating a qiime2 environment, and running the command above.

Alright, @muluoljira! I feel like I’ve been bombarding you with information here, but here’s one final take. This is the recommended approach, and doesn’t require you to mess around with outside tools or ascii tables. :tada:

2 Likes

[email protected] for your informative responses. Once again, hearty thanks for your kind and generous briefs. I have enjoyed it and solved it!

1 Like