Hi, I received an old multiplexed data set. I imported ,demultiplexed and denoised the reads. but classifying with sklearn-classifier didn't work The job was killed. As I read in several posts that Mixed oriented reads were a problem for sklearn, I looked back at my fastq-files and found in both 2.1-file ( which I renamed forward in importing) as in the 2.2-fastq-file, both the forward and reverse primer used.(marked in yellow)
As I understood the 2:N:0 would indicate reverse reads and this is in the heading of all sequences in the 2.2 file But as you can see with both primers. When I tried to use Cutadapt with the -p-mixed-oriented flag : it responded that that could be used with dual-indexed reads. I was not aware that I had those?
So My first question is : Are these mixed oriented reads ?
and my second one is : What is the right way to process these?
I'm not sure I have a perfect solution to this problem...
Are these reads in a random orientation, or is this a pair of R1 and R2 files that have been 'interleaved' so the one fastq file includes both the forward and reverse Illumina reads?
If so, you can use the reformat.sh tool from the BBTools package to deinterleave a fastq file, and perhaps your fasta file too.
reformat.sh in=reads.fq out1=read1.fq out2=read2.fq
This should fix interleaved reads, but will not help with mixed orientation reads.
The big question here is 'how did my reads get like this,' which we have to answer before we can move on to 'how do I convert them back.'
Hi Colin, thx for your reply. I'm not sure if I can answer the question if these are interleaved or mixed orientation reads. This is what I know:
This is a dataset which was sequenced in 2017 on the Illumina HiSeq PE300 platform with V4 primers. As usual, all the people analysing at the time have moved on in science and are not available anymore. I got the files (4 libraries with each 2 files named 2.1 and 2.2) and a metafile containing 1 barcode per sample (the first 8 nucleotides of each read). These are fastq.files, but I deleted the lines with quality information in the post to get a better over view.
I really hope that this is enogh information to dissect which type of problem it is.
Good, fastq files are the ideal input for our read repair tools. But I'm not sure that's needed.
As far as I can tell, these are normal paired end Illumina sequencing reads.
Files ending in _1.fastq.bz2 are your forward reads (confirmed by 1:N: in the read header), and files ending in _2.fastq.bz2 are your reverse reads (confirmed by 2:N: in the header).
When processing these reads, did you use DADA2 or vsearch or deblur? Where you able to get them to pair?
Maybe I'm getting ahead of myself. How did you import these demultiplexed reads into Qiime2?
I'm thinking the reads themselves may be fine, but there some other issue in the pipeline to resolve and I'm looking for clues!
I used the "standard" pipeline with DADA2 (version 2021.8) renaming the sequences to forward and reverse in directory muxed-pe-barcode-in-seq-L26 qiime tools import --type MultiplexedPairedEndBarcodeInSequence --input-path muxed-pe-barcode-in-seq-L26 --output-path multiplexed-seqsL26.qza
then demultiplexing demultiplexed-seqsL26.qzv (317.4 KB)
Yes, it's strange that both your 1:N:0 forward and 2:N:0 reverse reads include both the forward and reverse primers in them. This is what made me think these were 'interleaved' by some upstream program.
Given that you can't get the original BCL files and demultiplex again, my only idea would be to run that reformat.sh tool and see what it can produce.
I think I may be stumped! Have any other @moderators seen this before?
Hi Mike, I was going to try your solution. But it is still not so clear to me. ( I'm not that experienced wit QIIME.
In the protocol described in the solution it seems demultiplexed reads are used.
But after importing my reads are still multiplexed. The demultiplexing is already the steps were things go wrong isn't it? Because is demux otherwise not only keeping the reads with the barcodes belonging to the forward files in the forward.fastq.gz file and vice versa in the reverse file ?
Is there a way that you can filter/ separate forward and reverse reads out based on the primers before the demultiplexing step? That has to be done without trimming then, otherwise you would lose the barcodes. But then again how do you know which forward and reverse read belong together as you then fiddle with the order of the reads?
A lot of questions for one post maybe , but I hope it makes clear what/ where I don't get my head around.I hope it makes sense. Wieneke
Running the cutadapt command is only removing the primers from the sequences, not demultiplexing. Though cutadapt has some demultiplexing options. If those are not appropriate for your data, you can try this approach: