Hi. I have a situation where I have metabarcoding data that has already been demultiplexed into separate left and right reads for each sample, in PHRED-33 encoded fastq files. However, each pair of files contains two PCR products (16s and ITS).
I am able to separate the two PCR products using Obitools, but I am regretting it. The process is as follows:
Merge paired-end reads using obitools illuminapairedend:
illuminapairedend -r R1.fastq R2.fastq > merged.fastq
Separate ITS and 16S using obitools ngsfilter
ngsfilter \
-t ngsfilter_sample_description.txt \
--sanger \
--nuc \
--uppercase \
--fastq-output \
merged.fastq > sorted.fastq
In this case, ngsfilter_sample_description.txt contains the left and right primers… with ambiguous nucleotides… for both the ITS and 16s PCR products. It simply adds a tag (ITS and 16s) into the output files, Then the 16s and ITS reads can be separated with a simple grep command:
grep --no-group-separator -w -A3 ‘sample=ITS: experiment=myexperiment’
The text ‘sample=ITS: experiment=myexperiment’ is unique enough that it doesn’t accidentally occur in the quality scores (I checked).
So far, so good. This outputs two sets of fastq files that can be imported into QIIME2 using
qiime tools import \
--type SampleData[SequencesWithQuality] \
--input-path fastq_manifest.csv \
--output-path Single_end_demux.qza \
--source-format SingleEndFastqManifestPhred33;
Again, so far so good. The problem occurs when you look at the quality scores and realise that the illuminapairedend
command appears to be simply adding the quality scores where there is an overlap, which is absolutely ridiculous, and makes the whole workflow unusable. EDIT: all due respect to obitools developers, this appears to be actually working fine in that the quality score is the product of the two quality scores where the nucleotide is the same.
I know of no way to separate the ITS and 16s sequences without merging, because then the unmerged files do not have both the left and right primers. I suppose I could unmerge based on just the left primer in the left reads and just the right primer in the right reads, but is this acccurate enough?
I am aware that I can filter afterwards, mentioned here: How Do I Recover 18s Reads after Filtering in Qiime 2?
… but my primers are degenerate. I am also aware I can use exclude-seqs with the blastn-short method as mentioned here: How Do I Recover 18s Reads after Filtering in Qiime 2?
but I am not sure if blastn-short is going to accidentally throw out sequences because they don’t get a high enough blast score, artificially decreasing the diversity. I am somewhat bewildered at the range of possible methods.
Help?