Separating two PCR products in already demultiplexed paired FASTQ files

Hi. I have a situation where I have metabarcoding data that has already been demultiplexed into separate left and right reads for each sample, in PHRED-33 encoded fastq files. However, each pair of files contains two PCR products (16s and ITS).

I am able to separate the two PCR products using Obitools, but I am regretting it. The process is as follows:

Merge paired-end reads using obitools illuminapairedend:

illuminapairedend -r R1.fastq R2.fastq > merged.fastq

Separate ITS and 16S using obitools ngsfilter

ngsfilter \
-t ngsfilter_sample_description.txt \
--sanger \
--nuc \
--uppercase \
--fastq-output \
merged.fastq > sorted.fastq

In this case, ngsfilter_sample_description.txt contains the left and right primers… with ambiguous nucleotides… for both the ITS and 16s PCR products. It simply adds a tag (ITS and 16s) into the output files, Then the 16s and ITS reads can be separated with a simple grep command:

grep --no-group-separator -w -A3 ‘sample=ITS: experiment=myexperiment’

The text ‘sample=ITS: experiment=myexperiment’ is unique enough that it doesn’t accidentally occur in the quality scores (I checked).

So far, so good. This outputs two sets of fastq files that can be imported into QIIME2 using

qiime tools import \
  --type SampleData[SequencesWithQuality] \
  --input-path fastq_manifest.csv \
  --output-path Single_end_demux.qza \
  --source-format SingleEndFastqManifestPhred33;

Again, so far so good. The problem occurs when you look at the quality scores and realise that the illuminapairedend command appears to be simply adding the quality scores where there is an overlap, which is absolutely ridiculous, and makes the whole workflow unusable. EDIT: all due respect to obitools developers, this appears to be actually working fine in that the quality score is the product of the two quality scores where the nucleotide is the same.

I know of no way to separate the ITS and 16s sequences without merging, because then the unmerged files do not have both the left and right primers. I suppose I could unmerge based on just the left primer in the left reads and just the right primer in the right reads, but is this acccurate enough?

I am aware that I can filter afterwards, mentioned here: How Do I Recover 18s Reads after Filtering in Qiime 2?

… but my primers are degenerate. I am also aware I can use exclude-seqs with the blastn-short method as mentioned here: How Do I Recover 18s Reads after Filtering in Qiime 2?

but I am not sure if blastn-short is going to accidentally throw out sequences because they don’t get a high enough blast score, artificially decreasing the diversity. I am somewhat bewildered at the range of possible methods.


1 Like

I would suggest giving that a try.

Alternatively, just use deblur denoise-other for denoising! Deblur does not use the quality scores (it has a static error model) so it does not matter if the quality scores are ruined by merging. You will just use a reference sequence database like UNITE 97% OTUs as the input to --i-reference-seqs and it should work like a charm :gem: .

I hope that helps!

Appreciated. Can I ask if there is an existing reference database for UNITE 97% OTUs that is compatible with QIIME2? I can see . Or should I prepare a new file directly from the fasta and the reference taxonomy?

On reflection, the merged scores do make a certain amount of sense. If, at a merged position, you have two identical bases with a certain probability of error (= PHRED score) then in the merged read, your probability of error would drop a lot, and therefore the PHRED score would be very high, because you have two pieces of evidence that there really is a particular base there, thus seemingly ridiculously high PHRED scores. In this case the output PHRED score would be the product of the two PHRED scores.

Hi, thanks for your help- quick follow-up question. If using deblur denoise-other with (for example) UNITE 97% OTUs (for ITS region of fungi) or SILVA (for 16s region of bacteria), does this mean that only sequences that are present in the reference sequence database are retained, or is it the case that sequences that are similar to sequences in the reference database are also retained? If so, how similar do they need to be?

Whoa don’t go digging around a graveyard. :ghost:

Yes, UNITE has QIIME-compatible downloads (and for future reference we link to information like that from this page)

No, the sequences do not need to match the reference. It is only a very rough filter. I am not sure exactly what the % similarity needs to be — but it’s low, in the ballpark of 80-90% similar. This is just to exclude obviously non-target DNA. The deblur paper probably has that information.

For reference, the deblur denoise-16S method does not have a reference-sequences input, because it comes pre-packaged with the Greengenes 88% OTUs (I believe) as a nimble but effective filtering database. The idea is just to be a rough filter. So don’t sweat the details — 97% will “just work” (I’ve tested it personally :wink:)

I hope that helps!

Much appreciated. I’ll accept the answer once I’ve nursed the workflow through to completion. Yes, graveyard digging made me a bit uncomfortable.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.