Mixed_oriented_reads

shira · October 2, 2018, 8:21am

I am very happy to share the simple solution here, and hope it can help others! I have struggled quite a bit with this until arriving at this solution, and would love to spare the pain from others. I must give credit to @William who gave me the code and patiently answered my many many questions. Obviously it would be fabulous to have this integrated into the qiime2 pipeline , I know there are quite a bit of people struggling with this (@Martin, @emescioglu, @amit - to name a few), and at least one big company producing this kind of mixed-orientation data. Here is the simple code:

Use qiime1 to sort the sequences into R1 and R2 based on primer sequences, and extract barcodes

extract_barcodes.py -f SAM1-31_S2_L001_R1_001.fastq -r SAM1-31_S2_L001_R2_001.fastq -o ext_barcodes_data --bc1_len 8 --bc2_len 0 --input_type barcode_paired_end -m 120117SNwhoi341F-mapping.csv --attempt_read_reorientation --verbose

#This works like a charm to divide the reads into forward and reverse and extract the barcodes, resulting in true R1 and R2 files, as well as a barcode file for demultiplexing later. The primers are still in the sequence. You will need your R1 and R2 fastq files as well as your mapping file. Adjust the barcode length, in my case it was 8nt and only found on the forward reads.

Rename, move and Zip files

mkdir data
mv ext_barcodes_data/reads1.fastq data/forward.fastq
mv ext_barcodes_data/reads2.fastq data/reverse.fastq
mv ext_barcodes_data/barcodes.fastq data/
cd data/
gzip .

Import into QIIME2 and start processing

qiime tools import \
–type EMPPairedEndSequences \
–input-path data/ \
–output-path emp-paired-end-sequences.qza

qiime demux emp-paired \
–m-barcodes-file sample-metadata.tsv \ 
–m-barcodes-column BarcodeSequence \
–i-seqs emp-paired-end-sequences.qza \
–o-per-sample-sequences demux

qiime demux summarize \
–i-data demux.qza \
–o-visualization demux.qzv

mkdir 277-223

qiime dada2 denoise-paired \
–i-demultiplexed-seqs demux.qza \
–p-trim-left-f 17 \
–p-trim-left-r 21 \
–p-trunc-len-f 277 \
–p-trunc-len-r 223 \
–o-table 277-223/table.qza \
–o-representative-sequences 277-223/rep-seqs.qza \
–o-denoising-stats 277-223/denoising-stats.qza \
–p-n-threads 0 \
–verbose

#Note about denoising - as you probably know you will need to try several options to determine how to best trim the reads. I do this by first randomly subsampling the data using seqkit, and it seems crucial to allow enough overlap between the fwd and rev reads. In most cases I see the best results when I leave enough length to have about 50 bp overlap (assuming 450bp contigs), while trimming more from the rev reads that the fwd reads.

I hope this helps!