Hello everyone, I am new to this forum and after spending two days surveying this forum to solve my problem, I am finally asking for help.
I need to mention that I have already went through “Moving Picture” and “Atacama soil microbiome” tutorials and successfully completed them.
I am downloading Original data in fastq file. And I am trying to follow manifest protocol to import the artifact but I am unable to do so.
I just have two questions: 1) Am I doing it correct to follow “Fastq manifest” formats guide and using PairedEndFastqManifestPhred33V2 commands to import my data?
Is there any way to study the attributes example barcode etc of current run file?
Hi @qamrq25,
I haven’t looked through the data you linked but from the the looks of it, it is one fastq file. Is this correct? If so, then these are not demultiplexed (unless it is a single sample) so therefore the manifest import will not be applicable. Instead, go back to the Moving Pictures tutorial and follow that example by importing single-end reads, you’ll need to demultiplex these yourself and so will need the barcodes information. Hopefully those are readily available too.
Hey thanks for the reply. it is one fastq file from the run i.e. HOL_UK2_S389_Prokaryotes_SR1 but when I searched the study I also got another run file namely HOL_UK2_S389_Prokaryotes_SR2.
So I am guessing there are two fastq files per run.
Regarding the barcode information, since I am using some other studies data I don’t have it. Any work around will be appreciated.
Hi @qamrq25,
the HOL_UK2_S389_Prokaryotes_SR1 is most likely your forward reads, while HOL_UK2_S389_Prokaryotes_SR2 is the reverse reads. Unfortunately without the barcode information there is no way you can demultiplex these. Are you sure this isn’t provided in the metadata file provided from the study? Also, sometimes the barcodes will be placed in the header line of the sequences, there isn’t a way to demultiplex way these in QIIME 2 at the moment but if indeed barcodes are in the headers, at least you will be able to find a tool elsewhere that can accomplish this.
Also while going through the fastq file header, I saw forward and reverse primers but the primers are not included in the sequence line. Does this mean they are already trimmed or these files come like this in case of microbiome samples because my previous animal sequencing fastq files had primers also in sequence line.
So now the problem is to deal with barcodes. I am trying to see if I can locate barcodes in the Supplementary material but I am sure they aren’t provided.
Hi @qamrq25,
I’m honestly not sure what has happened to these reads. Your best bet is to track down the publication/group that handled this data and ask them directly for more info. Every sequencing facility does things a bit differently so there is no way to “expect” these to be. For example sometimes they may give you data with primers included or removed. As you noted for this data the primers have been removed and it appears that they are using the headers as a storage for a whole bunch of QC.
If you can’t contact the original creators of this data you could try maybe extracting the information from these headers. In particular forward_tag=tgctccaa and reverse_tag=tgctccaa might be the dual-indexed barcodes, though I’m just guessing here.
You’ll have to find a custom way of extracting that information from the headers to recreate a mapping file with barcode info. I don’t have any more recommendations on this task but you may want to ask/search around, someone out there surely has a script for handling these.
Thank you for the input. Yes! I have asked the author of the study to share his run file information with me. Hopefully I will get a positive response from him. Meantime I am experimenting with trimmomatic and see if I can extract the actual sequence for analysis.
Hi @Mehrbod_Estaki
So i have read some literature and some other queries stated by users on forums. I was able to import fastq files and got the following demux.qzv. I think this is not the correct sequence file. What to do you think?
I used the sratoolkit to dump fastq file from SRA run for both the forward and reverse reads ( I have not used the raw fastq files as they are giving error while data import)
Next, I used the paired-end manifest tutorial to import the above run files.
I converted the demux.qza to demux.qzv and got the results which I have shared above.
Further I am planning to denoise the data but wanted some insights from you guys.
Can you expand on this a bit more please? What did you use exactly prior to importing these files? The exact commands and perhaps a few lines of the fastq file?
Assuming this is the same dataset that you mentioned from the beginning of this thread, these were multiplexed reads, as in you only had 1 forward and 1 reverse fastq file for all of your samples. With this format you shouldn't be using the manifest format as I mentioned earlier:
When you download the fastq files, how many files per sample are there?
Also, these quality scores look pretty odd to me, the link you provided mentions these are Illumina MiSeq reads, but they are not the regular Phred33 or Phred64 format quality scores, the values in the NCBI explorer is showing quality values of between 40-80. I'm surprised QIIME 2 didn't raise some red flags when you were importing these. I'll have to ask around a bit on this, stay tuned.
I noticed there were too many header lines in the raw fastq files. So I planned to use SRA files since the header is short in them as compared to the header I posted before:
This downloaded the fastq files. Which I used in manifest fastq format. Also following is the FastQC image. I noticed the author has done some tweaking already before uploading the data files on NCBI:
HI @qamrq25,
So I took a look at the links you provided and the whole thing is very confusing.
The description says these are Shotgun data using a MiSeq,
The only explanation I can think of for these files is that these were PE reads that were merged prior to uploading onto SRA. The odd high phred scores you see there in the middle are typical when reads have been merged and re-assigned quality scores. In short I think you're right that there has been some preprocessing done with these reads prior to uploading them, which is unfortunate, but you still have options. You can either a) contact the authors and try to obtain the original unprocessed fastq files or b) just use these reads as is but you won't be able to use DADA2 for denoising. You can use Deblur or do OTU clustering with Vsearch.
Thank you very much for taking a detailed look at the sequence. I have contacted the authors but still no reply. I will try processing the sequence with Deblur and let you guys know what happens.