(Amit) #1

I have mixed oriented reads from Illumina and want to put forth this question whether QIIME2 can solve the problem with such data analysis. The main problem is the demultiplexing aand the pre-processing of Miseq reads which depends on the defined orientations.

Difficulty importing RAW illumina sequences to QIIME2
Analysis of fungi at LSU (28S)
(Colin Brislawn) #2

Unfortunately, Qiime 2 does not have a built-in method for dealing with these. :disappointed:

You can deal with these outside of QIime then import, but I might be outside the scope of your question. There is some more advice over here:

May I ask how you came in position of mixed-origination reads? Most Illumina reads are one way.



There is a very straightforward solution to this problem. I came across it after many trials and much reading and asking around.

Basically it involves a single (quick) step in qiime1 using the raw illumina data as input. After that step you can import into qiime2 as paired end sequences, and do your processing there as you would with ‘normal’ data.

If you are willing to work with qiime1 I am happy to share my code to help you easily overcome this.

(Nicholas Bokulich) #4

Please do share — even if @amit does not want to install qiime1 to accomplish this, other community members may find it useful!

It might also be possible to translate your qiime1 code into a :qiime2: workflow.

Thanks @shira!

(Amit) #5

I would be glad if the steps are explained alongwith the script. This will solve all my problems. Thank you so much for your support.



I am very happy to share the simple solution here, and hope it can help others! I have struggled quite a bit with this until arriving at this solution, and would love to spare the pain from others. I must give credit to @William who gave me the code and patiently answered my many many questions. Obviously it would be fabulous to have this integrated into the qiime2 pipeline :grinning:, I know there are quite a bit of people struggling with this (@Martin, @emescioglu, @amit - to name a few), and at least one big company producing this kind of mixed-orientation data. Here is the simple code:

Use qiime1 to sort the sequences into R1 and R2 based on primer sequences, and extract barcodes

extract_barcodes.py -f SAM1-31_S2_L001_R1_001.fastq -r SAM1-31_S2_L001_R2_001.fastq -o ext_barcodes_data --bc1_len 8 --bc2_len 0 --input_type barcode_paired_end -m 120117SNwhoi341F-mapping.csv --attempt_read_reorientation --verbose

#This works like a charm to divide the reads into forward and reverse and extract the barcodes, resulting in true R1 and R2 files, as well as a barcode file for demultiplexing later. The primers are still in the sequence. You will need your R1 and R2 fastq files as well as your mapping file. Adjust the barcode length, in my case it was 8nt and only found on the forward reads.

Rename, move and Zip files

mkdir data
mv ext_barcodes_data/reads1.fastq data/forward.fastq
mv ext_barcodes_data/reads2.fastq data/reverse.fastq
mv ext_barcodes_data/barcodes.fastq data/
cd data/
gzip .

Import into QIIME2 and start processing

qiime tools import \
–type EMPPairedEndSequences \
–input-path data/ \
–output-path emp-paired-end-sequences.qza

qiime demux emp-paired \
–m-barcodes-file sample-metadata.tsv \ 
–m-barcodes-column BarcodeSequence \
–i-seqs emp-paired-end-sequences.qza \
–o-per-sample-sequences demux

qiime demux summarize \
–i-data demux.qza \
–o-visualization demux.qzv

mkdir 277-223

qiime dada2 denoise-paired \
–i-demultiplexed-seqs demux.qza \
–p-trim-left-f 17 \
–p-trim-left-r 21 \
–p-trunc-len-f 277 \
–p-trunc-len-r 223 \
–o-table 277-223/table.qza \
–o-representative-sequences 277-223/rep-seqs.qza \
–o-denoising-stats 277-223/denoising-stats.qza \
–p-n-threads 0 \

#Note about denoising - as you probably know you will need to try several options to determine how to best trim the reads. I do this by first randomly subsampling the data using seqkit, and it seems crucial to allow enough overlap between the fwd and rev reads. In most cases I see the best results when I leave enough length to have about 50 bp overlap (assuming 450bp contigs), while trimming more from the rev reads that the fwd reads.

I hope this helps!:rainbow:

Plugin error from dada2: No reads passed the filter
(Amit) #7

Hi @shira is this to be done for Illumina Miseq as well. Kindly help.



Yes, that is the platform my data came from.

(Amit) #9

Hi @shira I tried using the extract_barcode.py command in qiime1. The problem I noticed is that the primers are also removed.
I gave this command ; extract_barcodes.py -f Pool4_S4_L001_R1_001.fastq -r Pool4_S4_L001_R2_001.fastq -o ext_barcodes_data --bc1_len 7 --bc2_len 7 --input_type barcode_paired_end -m mapping_file_barcodes_pool4_new.csv --attempt_read_reorientation --verbose.
It removes the primers. Can you kindly help. I have attached the mapping file information.mapping_file_barcodes_pool4_new.csv (839 Bytes)


How did you obtain your napping file? Do you have some info on how the libraries were built? Your code is removing 7 nt from both ends of your sequences, are you sure you have barcode there? In general, your mapping file looks odd. Was the same contig amplified in each sample? If so, you primer column should list the same sequence in all rows, at the 5’–3’ direction.

(Amit) #11

Hi @shira…The fastq R1 and R2 files were given to me by my collaborators. The mapping file I made to analyze the data are derived out of these sheets (see attachment). The file sample _library has all the details of the barcode and the data pools. The file Primers are from where the Primers were selected. And I have attached a mapping file for Pool3 which I had used in Qiime1 with the extract_barcodes.py command. mapping_file_barcodes_pool3.csv (743 Bytes)
Primers.csv (814 Bytes)
This amplicon library was from MiSeq. sample_library.csv (2.1 KB)
I KNOW THIS IS TOO MUCH FOR YOU TO LOOK INTO :unamused:, but kindly help me as I have no clue why the command is removing the primers. Am I making the mapping files correctly?. Kindly respond.

warm regards,


Talk to your collaborators to understand how they prepared the sequencing libraries.

There is no mystery in the code. Based on the primer sequences in your mapping file it will sort your reads into forward and reverse, and based on your numeric input it will remove the first x nucleotides from the forward and reverse ends and store threm in a separate barcode file, with reference to the read they are associated with.

It is always useful to take a look at the raw data and see if you understand the structure of the reads. Do they start with the barcode? Does the primer follow? Can you distinguish between
fwd and rev reads?

(Diana Taft) #13


I’ve run into this issue as well, and solved it using a tool called sabre. If you want, I can share the code I used with sabre too.

(Amit) #14

Hi @willowblade …that would be kind on your part…because I am lost :tired_face:

(system) #15

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.