Error with analysis of fastq file

Hello everyone, I am new to this forum and after spending two days surveying this forum to solve my problem, I am finally asking for help.
I need to mention that I have already went through “Moving Picture” and “Atacama soil microbiome” tutorials and successfully completed them.

Next I am experimenting to do analysis of data downloading from this run: https://www.ncbi.nlm.nih.gov/sra/SRX5598268
run file: https://sra-pub-src-2.s3.amazonaws.com/SRR8809301/HOL_UK2_S389_Prokaryotes_SR1.fastq

I am downloading Original data in fastq file. And I am trying to follow manifest protocol to import the artifact but I am unable to do so.

I just have two questions: 1) Am I doing it correct to follow “Fastq manifest” formats guide and using PairedEndFastqManifestPhred33V2 commands to import my data?

  1. Is there any way to study the attributes example barcode etc of current run file?

Thank you for your help guys!

Hi @qamrq25,
I haven’t looked through the data you linked but from the the looks of it, it is one fastq file. Is this correct? If so, then these are not demultiplexed (unless it is a single sample) so therefore the manifest import will not be applicable. Instead, go back to the Moving Pictures tutorial and follow that example by importing single-end reads, you’ll need to demultiplex these yourself and so will need the barcodes information. Hopefully those are readily available too.

Hey thanks for the reply. it is one fastq file from the run i.e. HOL_UK2_S389_Prokaryotes_SR1 but when I searched the study I also got another run file namely HOL_UK2_S389_Prokaryotes_SR2.

So I am guessing there are two fastq files per run.

Regarding the barcode information, since I am using some other studies data I don’t have it. Any work around will be appreciated.

Hi @qamrq25,
the HOL_UK2_S389_Prokaryotes_SR1 is most likely your forward reads, while HOL_UK2_S389_Prokaryotes_SR2 is the reverse reads. Unfortunately without the barcode information there is no way you can demultiplex these. Are you sure this isn’t provided in the metadata file provided from the study? Also, sometimes the barcodes will be placed in the header line of the sequences, there isn’t a way to demultiplex way these in QIIME 2 at the moment but if indeed barcodes are in the headers, at least you will be able to find a tool elsewhere that can accomplish this.

Also while going through the fastq file header, I saw forward and reverse primers but the primers are not included in the sequence line. Does this mean they are already trimmed or these files come like this in case of microbiome samples because my previous animal sequencing fastq files had primers also in sequence line.

@HWI-D00104:273:C6EDRANXX:8:1101:3922:3491_CONS_SUB_SUB_CMP ali_length=104; seq_ab_match=100; tail_quality=29.1; reverse_match=cctacggctaccttgttac; seq_a_deletion=0; sample=IT646; SGL_LAB=RS; WP=3; forward_match=cctgctccttgcacacac; forward_primer=cctgctccttgcacacac; reverse_primer=cctacggctaccttgttac; EarTag=IT098990178892; forward_score=72.0; score=302.232778128; seq_a_mismatch=4; forward_tag=tgctccaa; seq_b_mismatch=0; experiment=Plate11_ArchA_R1; mid_quality=51.2936507937; avg_quality=48.5410958904; seq_a_single=21; score_norm=2.90608440508; Azienda=Bianchini; reverse_score=76.0; direction=reverse; seq_b_insertion=0; seq_b_deletion=0; SamplingDate=18.12.2014; seq_a_insertion=0; seq_length_ori=146; reverse_tag=tgctccaa; goodAli=Alignement; AnimalID=4; seq_length=87; N_LAB=282; status=full; mode=alignment; head_quality=33.3; Giro=5; seq_b_single=21;

So now the problem is to deal with barcodes. I am trying to see if I can locate barcodes in the Supplementary material but I am sure they aren’t provided.

Hi @qamrq25,
I’m honestly not sure what has happened to these reads. Your best bet is to track down the publication/group that handled this data and ask them directly for more info. Every sequencing facility does things a bit differently so there is no way to “expect” these to be. For example sometimes they may give you data with primers included or removed. As you noted for this data the primers have been removed and it appears that they are using the headers as a storage for a whole bunch of QC.
If you can’t contact the original creators of this data you could try maybe extracting the information from these headers. In particular forward_tag=tgctccaa and reverse_tag=tgctccaa might be the dual-indexed barcodes, though I’m just guessing here.
You’ll have to find a custom way of extracting that information from the headers to recreate a mapping file with barcode info. I don’t have any more recommendations on this task but you may want to ask/search around, someone out there surely has a script for handling these.

Thank you for the input. Yes! I have asked the author of the study to share his run file information with me. Hopefully I will get a positive response from him. Meantime I am experimenting with trimmomatic and see if I can extract the actual sequence for analysis.

1 Like