Extract the barcodes from the paired-end reads

muluoljira · May 22, 2020, 5:39am

Hello!,

I am a beginner and studying Qiime2 recently. I received Illumina R1 and R2 with quality information. I want to create a sample metadata to proceed with the next steps. Could someone share with me how to extract barcodes from the paired-end reads using Qiime2?

Cheers!

ChrisKeefe · May 22, 2020, 5:58pm

Hi @muluoljira!
I'm not exactly clear on what format your data is in, but there are some awesome tutorials that will probably help you with data import and some basic workflows you might build on. (Check out the Moving Pictures, Atacama, and Parkinson's Mice tutorials for a few different flavors of basic analysis.)

Give those a read, see if you can figure things out, and let us know if you have more specific questions.

Happy :qiime2:-ing!
Chris

Melisa_Olivelli · May 22, 2020, 11:21pm

HI! I have the same type of raw data and it seems to be Casava 1.8 paired end demultiplexed. That worked for me, but I only have one sample. I also would need to know how to extract the barcodes from the sequences for the metadata file.
Cheers!

ChrisKeefe · May 22, 2020, 11:43pm

Welcome to the forum, @Melisa_Olivelli! q2-cutadapt can be really helpful for removing barcodes from sequences. There's plenty of info about it here on the forum (check out the search feature!), including this short-form tutorial, and the docs have a great section on available plugins that will give you the official documentation.

Have a great weekend!
Chris

muluoljira · May 23, 2020, 2:42am

Thank you for your awesome response. I have checked out the flavored sauce of Moving Pictures, Atacama, and Parkinson’s Mice tutorials. My raw data format seems neither Casava 1.8 nor EMP. I am trying to sort out with “Fastq manifest” formats. I created fastq manifest file for paired-end read data however, I couldn't understand how to decide on the assumptions to use 'PairedEndFastqManifestPhred33' or 'PairedEndFastqManifestPhred64'.
Bests!

ChrisKeefe · May 26, 2020, 8:42pm

Sounds like you're making progress, @muluoljira. Quality scores in FASTQ data are written using an "alphabet" of 43 characters. In this context, "33" and "64" describe the first ASCII character in the block of characters used in quality scores. (Every ASCII character has a number associated with it. 33 is ! and 64 is @, so 33-formatted quality scores are written with the characters from ! to K, while 64-formatted scores use @ to j.)

This bit on Quality Score Variants, which I found in the importing tutorial, gives a high-level overview of which machines use which variants, and is probably your best bet. If you don't know what equipment was used in sequencing, you could ask your sequencing center.

If that's not possible, you could preview some of your raw data (e.g. less my_data.fastq) and compare the quality-score characters to the characters in each format's group of accepted characters. Hopefully it doesn't come to that!

Best,
Chris

ChrisKeefe · May 27, 2020, 5:19pm

@muluoljira, the inimitable @Mehrbod_Estaki shared another useful tool that might help you determine which kind of data you've got:

vsearch is included with QIIME 2, so you should be able to use it by activating a qiime2 environment, and running the command above.

ChrisKeefe · May 27, 2020, 6:18pm

Alright, @muluoljira! I feel like I've been bombarding you with information here, but here's one final take. This is the recommended approach, and doesn't require you to mess around with outside tools or ascii tables.

muluoljira · May 28, 2020, 3:53am

Thanks@ChrisKeefe for your informative responses. Once again, hearty thanks for your kind and generous briefs. I have enjoyed it and solved it!

system · June 28, 2020, 9:53am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.