Used degenerate primers for sequencing, concerned that my reads are being confused as chimeras

bkramer · August 28, 2021, 10:24pm

Hello,

I have demultiplexed paired-end reads that were amplified with the igk3/dvv primer pair, which has been used to taxonomically classify diazotrophic communities. These primers:

IGK = GCNWTHTAYGGNAARGGNGGNATHGGNAA
DVV = ATNGCRAANCCNCCRCANACNACRTC

Could be argued to be fairly degenerate, especially considering the number of N's in either primer

When I used the dada2 denoise-paired command, however, of the 90,000 forward and 90,000 reverse reads I originally had, only 3 sequences pass the filtering steps, as it appears that an overwhelming majority of my sequences are considered chimeras, although I think that might be because of the primers I used were degenerate, so at least some of my sequences might be confused as chimeras.

I've tried several ways to work around this issue, such as trying to retain borderline chimeras (Identifying and filtering chimeric feature sequences with q2-vsearch — QIIME 2 2021.4.0 documentation), although I think that requires that my chimeras be retained after DADA2. So, I've tried using commands such as --p-chimera-method none (just as it's written, changing it from the default consensus) or --p-min-fold-parent-over-abundance 8 (just as it's written, changing it from 1), but the chimera filter apparently remains on, as only 3 sequences still pass the filtering.

I've attached the original results from DADA2 (without tweaking with the chimera detection methods). Any advice anyone has on how to deal with this issue would be greatly appreciated. Thank you!!

Ben

reads-cutadapt.qzv (319.8 KB) rep-seqs_nifH.qzv (203.0 KB) table.qzv (406.1 KB)

SoilRotifer · September 1, 2021, 9:49pm

Hi @bkramer

Given the insane degeneracy of the primer sequence I suspect this amplifies an awful lot of other things, making denoising quite difficult. That is, it may be that the generated sequence data are not necessarily orthologous sequences, i.e. the sequences are a mix of different genes. Which may negatively impact the denoising process.

Even running BLAST on these primer sequences returns no significant hits. If you remember I referred you to the following article, in this post and noted much work the authors did to QA/QC the sequence data for nifH. That is, they used a comprehensive pipeline, including HMMs etc... I did not read through it thoroughly but, it seemed like an onerous process.

I'd suggest that you do a little experiment and forgo denoising for now. That is, use the traditional OTU dereplication/clustering approach, to generate your representative sequences, then tabulate the sequences like so:

qiime feature-table tabulate-seqs \
  --i-data rep-seqs.qza \
  --o-visualization rep-seqs.qzv

Then click on the sequences within the resulting qzv file. This will take you to an BLAST form to perform a query search. I'd suggest randomly picking a few sequences and run BLAST to see how consistent, or not, the BLAST results are (i.e. are they the same gene).