Error with analysis of fastq file

qamrq25 · May 17, 2020, 2:15pm

Hello everyone, I am new to this forum and after spending two days surveying this forum to solve my problem, I am finally asking for help.
I need to mention that I have already went through "Moving Picture" and "Atacama soil microbiome" tutorials and successfully completed them.

Next I am experimenting to do analysis of data downloading from this run: Shotgun 2nd generation sequencing of milking cows' rumen microbiome. - SRA - NCBI
run file: https://sra-pub-src-2.s3.amazonaws.com/SRR8809301/HOL_UK2_S389_Prokaryotes_SR1.fastq

I am downloading Original data in fastq file. And I am trying to follow manifest protocol to import the artifact but I am unable to do so.

I just have two questions: 1) Am I doing it correct to follow “Fastq manifest” formats guide and using PairedEndFastqManifestPhred33V2 commands to import my data?

Is there any way to study the attributes example barcode etc of current run file?

Thank you for your help guys!

Mehrbod_Estaki · May 18, 2020, 4:48am

Hi @qamrq25,
I haven't looked through the data you linked but from the the looks of it, it is one fastq file. Is this correct? If so, then these are not demultiplexed (unless it is a single sample) so therefore the manifest import will not be applicable. Instead, go back to the Moving Pictures tutorial and follow that example by importing single-end reads, you'll need to demultiplex these yourself and so will need the barcodes information. Hopefully those are readily available too.

qamrq25 · May 18, 2020, 5:16am

Hey thanks for the reply. it is one fastq file from the run i.e. HOL_UK2_S389_Prokaryotes_SR1 but when I searched the study I also got another run file namely HOL_UK2_S389_Prokaryotes_SR2.

So I am guessing there are two fastq files per run.

Regarding the barcode information, since I am using some other studies data I don't have it. Any work around will be appreciated.

Mehrbod_Estaki · May 18, 2020, 6:26am

Hi @qamrq25,
the HOL_UK2_S389_Prokaryotes_SR1 is most likely your forward reads, while HOL_UK2_S389_Prokaryotes_SR2 is the reverse reads. Unfortunately without the barcode information there is no way you can demultiplex these. Are you sure this isn't provided in the metadata file provided from the study? Also, sometimes the barcodes will be placed in the header line of the sequences, there isn't a way to demultiplex way these in QIIME 2 at the moment but if indeed barcodes are in the headers, at least you will be able to find a tool elsewhere that can accomplish this.

qamrq25 · May 18, 2020, 2:14pm

Also while going through the fastq file header, I saw forward and reverse primers but the primers are not included in the sequence line. Does this mean they are already trimmed or these files come like this in case of microbiome samples because my previous animal sequencing fastq files had primers also in sequence line.

@HWI-D00104:273:C6EDRANXX:8:1101:3922:3491_CONS_SUB_SUB_CMP ali_length=104; seq_ab_match=100; tail_quality=29.1; reverse_match=cctacggctaccttgttac; seq_a_deletion=0; sample=IT646; SGL_LAB=RS; WP=3; forward_match=cctgctccttgcacacac; forward_primer=cctgctccttgcacacac; reverse_primer=cctacggctaccttgttac; EarTag=IT098990178892; forward_score=72.0; score=302.232778128; seq_a_mismatch=4; forward_tag=tgctccaa; seq_b_mismatch=0; experiment=Plate11_ArchA_R1; mid_quality=51.2936507937; avg_quality=48.5410958904; seq_a_single=21; score_norm=2.90608440508; Azienda=Bianchini; reverse_score=76.0; direction=reverse; seq_b_insertion=0; seq_b_deletion=0; SamplingDate=18.12.2014; seq_a_insertion=0; seq_length_ori=146; reverse_tag=tgctccaa; goodAli=Alignement; AnimalID=4; seq_length=87; N_LAB=282; status=full; mode=alignment; head_quality=33.3; Giro=5; seq_b_single=21;

qamrq25 · May 18, 2020, 2:14pm

So now the problem is to deal with barcodes. I am trying to see if I can locate barcodes in the Supplementary material but I am sure they aren't provided.

Mehrbod_Estaki · May 18, 2020, 4:47pm

Hi @qamrq25,
I'm honestly not sure what has happened to these reads. Your best bet is to track down the publication/group that handled this data and ask them directly for more info. Every sequencing facility does things a bit differently so there is no way to "expect" these to be. For example sometimes they may give you data with primers included or removed. As you noted for this data the primers have been removed and it appears that they are using the headers as a storage for a whole bunch of QC.
If you can't contact the original creators of this data you could try maybe extracting the information from these headers. In particular forward_tag=tgctccaa and reverse_tag=tgctccaa might be the dual-indexed barcodes, though I'm just guessing here.
You'll have to find a custom way of extracting that information from the headers to recreate a mapping file with barcode info. I don't have any more recommendations on this task but you may want to ask/search around, someone out there surely has a script for handling these.

qamrq25 · May 19, 2020, 9:35am

Thank you for the input. Yes! I have asked the author of the study to share his run file information with me. Hopefully I will get a positive response from him. Meantime I am experimenting with trimmomatic and see if I can extract the actual sequence for analysis.

qamrq25 · June 3, 2020, 11:07am

Hi @Mehrbod_Estaki
So i have read some literature and some other queries stated by users on forums. I was able to import fastq files and got the following demux.qzv. I think this is not the correct sequence file. What to do you think?

Steps I have done to achieve this demux file:

I used the sratoolkit to dump fastq file from SRA run for both the forward and reverse reads ( I have not used the raw fastq files as they are giving error while data import)
Next, I used the paired-end manifest tutorial to import the above run files.
I converted the demux.qza to demux.qzv and got the results which I have shared above.

Further I am planning to denoise the data but wanted some insights from you guys.

Mehrbod_Estaki · June 3, 2020, 3:12pm

HI @qamrq25,

Can you expand on this a bit more please? What did you use exactly prior to importing these files? The exact commands and perhaps a few lines of the fastq file?

Assuming this is the same dataset that you mentioned from the beginning of this thread, these were multiplexed reads, as in you only had 1 forward and 1 reverse fastq file for all of your samples. With this format you shouldn't be using the manifest format as I mentioned earlier:

When you download the fastq files, how many files per sample are there?

Also, these quality scores look pretty odd to me, the link you provided mentions these are Illumina MiSeq reads, but they are not the regular Phred33 or Phred64 format quality scores, the values in the NCBI explorer is showing quality values of between 40-80. I'm surprised QIIME 2 didn't raise some red flags when you were importing these. I'll have to ask around a bit on this, stay tuned.

qamrq25 · June 6, 2020, 12:25pm

I noticed there were too many header lines in the raw fastq files. So I planned to use SRA files since the header is short in them as compared to the header I posted before:

@SRR8809301.1 HISEQ:123:H2CK3BCXX:1:1101:6047:20322_CONS_SUB_SUB length=250

First command I use and it downloaded one fastq files. From which I guess the SRA are single end since only one fastq file was downloaded.:

fastq-dump.exe SRR8809301

@SRR8809301.1 HISEQ:123:H2CK3BCXX:1:1101:6047:20322_CONS_SUB_SUB length=250
TACGGAGGATGCGAGCGTTATCCGGATTCATTGGGTTTAAAGGGTGCGCAGGCGGCGCC
+SRR8809301.1 HISEQ:123:H2CK3BCXX:1:1101:6047:20322_CONS_SUB_SUB length=250
KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKIIIIKooooooomidooiooooiiVoqoooooqqoo

This downloaded the fastq files. Which I used in manifest fastq format. Also following is the FastQC image. I noticed the author has done some tweaking already before uploading the data files on NCBI:

I got only one file which is strange.

Mehrbod_Estaki · June 6, 2020, 7:35pm

HI @qamrq25,
So I took a look at the links you provided and the whole thing is very confusing.
The description says these are Shotgun data using a MiSeq,

SRX5598268: Shotgun 2nd generation sequencing of milking cows' rumen microbiome.
1 ILLUMINA (Illumina MiSeq) run: 16,741 spots, 4.2M bases, 3.1Mb downloads

But then later they describe it as amplicon data:
Study: Ruminomics - 1000 cows microbiome amplicon survey

And the fastq files suggest these were in fact sequenced using a HiSeq machine.

@HISEQ:123:H2CK3BCXX:1:1101:6047:20322_CONS_SUB_SUB ali_length=190; seq_ab_match=190; tail_quality=41.0; reverse_match=ggactaccagggtatctaatc; seq_a_deletion=0; sample=UK389; reverse_tag=cgcgtaat; reverse_score=84.0; WP=3; forward_match=gccagcagccgcggtaa; forward_primer=gccagcmgccgcggtaa; reverse_primer=ggactaccmgggtatctaatc; EarTag=UK 220168302173; forward_score=68.0; score=716.912694551; seq_a_mismatch=0; forward_tag=ctgttgtg; seq_b_mismatch=0; experiment=Plate15_ProkA_R1; mid_quality=63.4551724138; avg_quality=62.0064516129; seq_a_single=60; score_norm=3.77322470816; status=full; direction=forward; seq_b_insertion=0; seq_b_deletion=0; seq_a_insertion=0; seq_length_ori=310; date=24.03.15; ID=2173; goodAli=Alignement; seq_length=250; mode=alignment; head_quality=41.0; seq_b_single=60;
tacggaggatgcgagcgttatccggattcattgggtttaaagggtgcgcaggcggcgcccaagtcagcggtcaaagttccgggctcaacccggtgatgccgttgaaactggggtgcttgagcaggtgcgacggaggcggaatgcgtggtgtagcggtgaaatgcatagatatcacgcagaaccccgattgcgaaggcagcttccgagcacccgtctgacgctgaggcacgaaagcgtgggtatcgaacag
+
KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKIIIIKooooooomidooiooooiiVoqoooooqqoooimgo_doom]VVodo__oooqmmoooooooioooooqqqqqooioooqooqooomoqoqmooomqqqomooqqqqommoqomoqqqoqooomoooommmmoomooommgqmoooommqoqooooooooqoiooqoooomoV_ViooooooKKKIIIIIKKKKKKIIKKKKKKKKKKKK

The quality scores are also pretty weird, I see ASCII characters above the 106 index limit of Phred64 (the max I would expect to see in fastq files)

Reading FASTQ file 100%
Read 16741 sequences.
Qmin 36, QMax 113, Range 78
Guess: -fastq_qmin 3 -fastq_qmax 80 -fastq_ascii 33
Guess: Illumina 1.8+ format (phred+33)

Letter N Freq MaxRun
 A    1063004  25.2%      4
 C     812188  19.3%      5
 G    1443586  34.2%      7
 N          3   0.0%      0  Q=%
 T     898362  21.3%      5
Char ASCII Freq Tails

'$' 36 0.0% 0
'%' 37 0.0% 0
'.' 46 0.0% 0
'/' 47 0.0% 0
'0' 48 0.3% 0
'7' 55 0.0% 0
'8' 56 0.0% 0
'9' 57 0.0% 0
';' 59 0.1% 0
'<' 60 0.0% 0
'=' 61 0.0% 0
'>' 62 0.3% 1
'A' 65 0.0% 0
'B' 66 0.0% 0
'C' 67 1.0% 4
'D' 68 0.1% 0
'G' 71 0.1% 0
'H' 72 0.1% 0
'I' 73 10.2% 957
'J' 74 0.0% 0
'K' 75 15.0% 6346
'M' 77 0.0% 0
'N' 78 0.4% 0
'R' 82 0.1% 0
'T' 84 2.1% 0
'V' 86 1.6% 0
'W' 87 0.2% 0
'' 92 0.3% 0
']' 93 0.8% 0
'_' 95 0.8% 0
'a' 97 0.5% 0
'b' 98 1.2% 0
'd' 100 0.9% 0
'g' 103 5.4% 0
'i' 105 4.7% 0
'm' 109 16.5% 0
'o' 111 27.7% 0
'q' 113 9.8% 0

The only explanation I can think of for these files is that these were PE reads that were merged prior to uploading onto SRA. The odd high phred scores you see there in the middle are typical when reads have been merged and re-assigned quality scores. In short I think you're right that there has been some preprocessing done with these reads prior to uploading them, which is unfortunate, but you still have options. You can either a) contact the authors and try to obtain the original unprocessed fastq files or b) just use these reads as is but you won't be able to use DADA2 for denoising. You can use Deblur or do OTU clustering with Vsearch.

qamrq25 · June 8, 2020, 2:47am

Thank you very much for taking a detailed look at the sequence. I have contacted the authors but still no reply. I will try processing the sequence with Deblur and let you guys know what happens.

system · July 9, 2020, 8:47am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.