I am trying to learn how to use Qiime2. I have already tried the tutorials and now I am working on my own data.
I am dealing with two runs of pair end data. Unfortunately, I did not get a barcode.fastq file from the sequencing company. I could not find a way to transform my data to what is needed for Qiime2 according to the "Atacama tutorial", so I have done the following.
I started off with files provided by the company that had already been joined.
I then concatenated the fasta files from both runs, as well as the qual files from both runs.
Converted them to fastq
Extracted the barcodes with Qiime1 (extract_barcodes.py) without reverse complementing them (when I did reverse them, I only had 3 samples left).
Compressed fastq files to fastq.gz
Imported them into qiime2 as single end sequences
Demultiplex them as single-end reads
Summarize the data
When viewing the results, I obtain this very weird graph:
I would very much appreciate if anyone could help me understand this graph, and where may I have made a mistake.
Sounds like you are having similar trouble to this user. I think the funny quality plot that you are seeing is technically okay — it looks like maybe what is happening is the read joining method used by the company is assigning an arbitrarily high PHRED score to the overlapping bases. The non-overlapping bases are unadjusted, leaving you with a goofy looking plot. But I think that is okay! (though the lack of real quality scores may impact the ability of methods like dada2/deblur to detect noisy sequences within this overlapping region )
Check out this tutorial on working with joined paired-end sequences — note that the quality plots on that page look similar to yours.
I have a follow-up question. I have just realized that the company attempts to join reads if the amplicon is longer than 300bp, which is not my case (I used 515F-806R). Given the length of my amplicon, would joining the fwd and reverse reads be of bad quality?
If this is the case, would it be better if I just analyze the data as single reads without joining them? Since the company provides a mixture of forward and reverse reads in each of the R1 and R2 files, how could I proceed? Should I reorient the reverse reads in both files, concatenate the files and remove primers and adaptors, then import them to qiime2 as single end sequences?
If it’s possible to separately upload these reads into QIIME2, I would personally go that route and run through dada2 (which will join reads after QC).
It sounds like that might not be possible, though.
It is worth using your data as-is and see what happens — if the results seem rather noisy upon further inspection, it may be that the paired-end joining prior to dada2 is preventing detection of sequence errors. @benjjneb — do you have any thoughts on how read joining prior to dada2 would impact error detection?
I am not sure if this will prevent importation, but having reads in mixed orientations will cause issues for alignment and ASV/OTU detection downstream. The post that you linked to has the additional issue that reverse primers are contained in the read (which also implies that non-biological sequence is downstream of that!), which should not be an issue with your reads. So perhaps we can make something work.
Perhaps you could re-orient your reads, e.g., if the reverse primer is detected in a read then reverse complement it? (and vice versa for the reverse read). If primers are not still in reads, perhaps there is still a way (outside of QIIME2)… e.g., it sounds like you do have barcodes in-line in the reads:
Concatenate R1 + R2 --> “forward” reads
Concatenate R2 + R1 --> “reverse” reads
extract barcodes from “forward” reads.
Import into QIIME2 as paired reads.
demux and only read pairs in the correct forward orientation will be retained.
Would that work?
Alternatively, with a custom script you could determine whether the first X bases match an expected barcode; if yes, write out the forward/reverse reads as they are, and write out the barcode to a barcodes file; if no, determine whether the first X bases on the reverse read match an expected barcode; if yes, write out the reverse complements, and write out the barcode to the barcodes file.
I hope one of those solutions will solve your issue! Please let us know — this could help us decide how best to support sequence data following your setup in the future.
Indeed, it sounds to me, that you encounter the same problems I had (with the same company?).
If I understood right, the company generates first an amplicon using the barcode+forward primer and reverse primer (without technical sequencing adapters). Then, they clone the amplicon into a vector, which introduces the Illumina sequencing adapater (this reduces PCR bias). I am not sure about the details, but you end up with amplicons oriented in both directions. Therefore, the company joins the reads from the forward run and the reverse run and re-orientates everything in the same direction. For joining they allow a lot of mistakes.
Therefore, I ended up with asking the company for the raw sequencing data (they were downloadable from Illumina basespace) to join and re-orientate them by my own allowing as less joining mistakes as possible. I found this discussion very helpful. Finally I ended up with the following qiime1 script:
thanks for your message and the scripts! I will definitely try them as they seem exactly what I need.
Thank you also for bringing to my attention the qiime1 forum discussion, I also find it very useful.
I also happen to have the reverse primers in the sequence, so I will follow the steps that you and @thermokarst described in your thread. Thank you again! They are very informative.
So, would the pipeline look like this? Extract barcodes, join reads, import your sequences to Qiime2 with the multiplexed single-end protocol, demultiplex, then trim fwd and reverse primers, deblur?
And one last question (sorry, I am not very familiar with Ilumina data yet): why wouldn’t it make sense for my reads to be joined? (As a bit of background information, just in case it helps, my amplicon is 291bp long and I have used 2*251bp sequencing).