Generally speaking, these red boxes don’t mean there is an error, just that the observed number of sequences at that particular bp is less than the maximum observed number of sequences. It means you should keep that in mind - once you get into bp positions greater than your shortest read, you should keep that in mind.
With that said, your quality plot does look a little strange, can you provide some details on sequencing tech, any preprocessing that might have happened, or anything else noteworthy? It doesn’t really look like typical Illumina reads to me. The shortest reads observed were ~35 bp long, which is quite a bit shorter than the rest of the observed reads, that is certainly a bit noteworthy to me. One thing to note is that this viz performs a random subsampling of 10,000 samples (controlled by parameter if you want to override). I also noticed that once you get to the last bp, there are only about 5000 subsampled reads there, which is a pretty drastic drop off.
Let us know what you learn about these data. Thanks!
Sure, the sequence tech used was Illumina Myseq following the manufacturer´s guidelines. The sequencing company provided the raw reads as fastq, which were separated by sample ID using a software provided by the company. This sofware produced an output directory with Read 1 and Read 2 fastq files for each sampleID and a directory containing the index file (barcodes). These files were imported to qiime2 with the Casava1.8 format, and then I proceeded to denoise with dada2.
There wasn´t a problem with the sequencing as far as I know.
Thanks for the info @Vixer! These reads sure do look strange to me — and if I was a bettin’ man, I would wager that DADA2 is not going to like these data at all (you could however run this through deblur, I don’t expect there to be any issues there). Did you run “this software” to demultiplex, or did the sequencing center (it isn’t clear to me who did what)? What was the software used?
Personally, I would check in with the sequencing center to learn a bit more about how these reads were prepared, and if there was any pre-processing or quality filtering that was applied - these are all good things to know when interpreting downstream results such as diversity metrics and taxonomic classification.
Keep us posted!
I did the demultiplexing. The company provides a FastQ processor which somes with the next description: " files generated from Illumina NGS platforms and creates a directory containing fastq files for each individual sample. The input files required by the user are the unzipped RAW Read 1 and Read 2 fastq files as well as the corresponding mapping file. The output directory will contain a new Read 1 and Read 2 fastq file for each sampleID as well an additional directory containing the index file. This tool was created in order to assist users who are interested in using mothur or qiime in order to process their 16s rRNA gene sequences This tool was created in order to assist users who are interested in using mothur in order to process their 16s rRNA gene sequences. The output directory can be referenced by the mothur make.file command, which will generate the necessary files to proceed with the mothur MiSeq workflow. The necessary index and oligos file are also provided should you choose to not to use the mothur make.file command."
Also, I have already ran DADA2 on my samples, after processing i lost around 20,000-30,000 reads per sampl. Also, the sequences still have the primers so thats why i used dada2 to remove the. Can you trim the 5’ with deblur too?
And I´m going to write them asking more info about their process. Thanks for the suggestion!
EDIT: just checked the pipeline and there wasnt anything strange about the sequencing process, the files are fully raw and unbinned. So I guess my samples are a rare case.
Another Edit, looking around the forum I noticed this post Importing and Demultiplex process for 4 Fastq Files: R1, R2, Index1 and Index2 in which her data comes with R1, R2, index1 and index2 files to import into qiime2. I think I did Import the data wrong (Because I used the Casava1.8 instead because the files looked like the ones that you can import with this method) but checking my separated R1 and R2 files seems there isnt a barcode in the sequences.
Hi Vixer - did your files from the sequencing center come with separate index read files? If so I needed to use a special script with QIIME 1 to add the barcodes back to the read 1 and read 2 files… If this is the case I can provide you with the detailed steps I had to take in QIIME 1 (Barcode attachment, Barcode extraction, Joining, and then demultiplexing) before I could import the seq.fna file into QIIME 2.
Also, Demultiplexed samples should not have the barcodes or adaptor sequences in the reads.
I also could not use Deblur or DADA2 with my sequence files and had to process and analyze my data with VSEARCH .
Well, as I mentioned before I got the raw R1 and R2 data and the metadata. The sequencing comany provided a software that takes these 3 files and separates the data into R1 and R2 for each sample,(these are without barcodes, but with the primers,) named like this: sample1_S29_L002_R1_001 and sample1_S29_L002_R2_001, but it also produced two index files. So by the time i thought my sequences could be imported with the casava1.8 process but the software used to separate the samples implies that if you dont want to use these files separated, you can just use the raw data with the index files and do the method you used in your post.
I think I´m overthinking everything and just getting everything more complicated hahaha.
Hi @Vixer, let's assume your demux process via Mr. DNA's software is okay.
Here are the quality score boxplots for your data:
Here are the quality score boxplots for "typical" Illumina Paired-end data:
The narrow band of high-quality scores in your reads seems pretty suspicious to me --- it looks like some kind of quality filtering has already been applied to these data, since they don't exhibit the "typical" error profile one would expect to see. Anyway, keep us posted!
Yeah i noticed that too, but since checking the pipeline and asking the sequencing company the data wasnt filtered or modified, thanks for the help also, few posts ago you commented this:
I processed my data using DADA2, using deblur could give different results, or is there a reason for using deblur instead of DADA2?
Hi @Vixer! I recommend checking out the DADA2 and Deblur papers to learn about the differences between the denoising methods. Also searching the forum will yield several topics comparing DADA2 and Deblur.
I highly recommend obtaining the raw sequence data and quality scores from the sequencing center if possible, and trying out DADA2 and/or Deblur with those data. You could also try out those tools with the current sequence data you have on hand, and carefully compare the results to make sure they look sane. I can’t make any guarantees that either method will work well with these data though. Also, it may be hard to publish a paper without details about the initial quality filtering steps that appear to have been performed. Sorry to not have a more definitive solution for you!
Okay, thanks for the help!
Hello again. Just to keep you updated. I just used the raw data and imported it to qiime 2 following the q2-cutadapt tutorial for paired end reads and the new quality plot looked exactly like the one I posted days ago. The Q25 sequence data wasn´t filtered, so, I guess it´s really just a rare case.
Anyways, thanks for the support!
Thanks for the followup, @Vixer. These data are almost certainly quality filtered, it just looks like that happened prior to demultiplexing (this is really important for you to have all the details on this step, you will most likely want/need to report that quality filtering when publishing these data). As you found, though, this probably won’t impact your results when running DADA2 — at worst it looks like the is probably just a redundant step. Anyway, if you learn something more about how the quality filtering was applied at Mr. DNA HQ, we would love to know, I am sure other users will have similar questions in the future. Thanks!
Yes, I also met this problem, the miseq data are from MR DNA, L002_R1_001.fastq and L002_R2_001.fastq, I tried the FastQ, the sequence ID is complete mess after demultiplexing. They did not use long concatamer primers as part of Illumina data but their own barcode (8-mer) and primers. The barcode is in the forward primers, but the R1 and R2 files contain a mix of forward and reverse reads, and the R1 and R2 are both in the 5’-3’ orientation as raw files.
I tried to extract_barcodes.py, and then want using EMP protocol to demultiplexing, but the barcode is not right, I think perhaps because of the mix of forward and reverse reads in R1 or R2. I also try to use q2-cutadapt demux-paired to demux, but q2 showed cannot find forward or reverse seq.
My question is how to demultiplexing this files? Thanks a lot!
Yikes! I don’t have a really great solution for you - I would contact Mr DNA and see if there is a way for them to provide you with demultiplexed sequences. It sounds like their tool “FASTQ Processor” is intended for dealing with this - if you are having issues with that program I would suggest you reach out to their support channels. Keep us posted on this. Thanks!
Hi, may I ask if you heard back from Mr DNA? I have the same issue.
Hello, Are you using the R1 and R2 fastq directly into QIIME2?
In their pipeline they recommend to use their fastaQ processor (you get it from here: http://www.mrdnafreesoftware.com/). It separates the R1 and R2 per sample. You must decompress the R1 and R2 raw data and select them with the software, the barcode file is your mapping file (mapping.txt, this one containts the F and R primers).
After that you move the demultiplexed files into QIIME2 and compress them into (.gz) format and import them with the Casava 1.8 paired-end demultiplexed fastq method and you are ready to work with your reads.
After that you
ll get your demux artifact, and maybe your quality will look like mine. I already talked to them and there wasnt a filtering process. Its the raw data straight from the sequencing machine, even without using the FASTQ processor my quality plot looked the same, but i didnt have problems with working with my reads.
Hope this helps you.
Thank you @Vixer. Sorry if I am confused but, after processing my samples in FastaQ, I still have reads with the forward primer and other reads with the reverse primer both in R1 and R2 files for each sample. So the FastaQ Processor Mr DNA provides does not solve the issue of having both forward and reverse reads in R1 and R2 files. Am I right?