Odd start to certain 16S sequences following demux and denoising

Hi,

I am analysing 16S V4 paired-end reads generated using the EMP protocol and am a novice at using Qiime2.

Following demultiplexing and running the data through dada2, I eye-balled the representative sequences of the different features and found several (ca. 25 out of 1000+ features) that had strange starts to their nucleotide sequences, dominated by Cs and Ts (see a couple of examples below). Blasting these sequences reveals that these start sections remain unaligned, whilst the remaining part of the sequence aligns strongly, suggesting that they are artifactual. Whilst the number of aberrant sequences I've found is a relatively low % of the total (although there are probably more I haven't found), I am worried that I have carried out one of the steps in Qiime2 incorrectly and that these aberrant sequences could negatively impact downstream analyses.

So does anyone have any advice on whether this pattern is normal and if so how best to filter the aberrant sequences? If this is normal in 16S datasets, why were these sequences not removed by dada2. If not, what am I doing wrong?

Many thanks in advance!

I am using qiime2-2021.8

My demux commands were:

qiime demux emp-paired
--i-seqs paired-end-sequences.qza
--m-barcodes-file demux_metadata.txt
--m-barcodes-column barcode-sequence
--p-rev-comp-mapping-barcodes
--p-rev-comp-barcodes
--o-per-sample-sequences demux-full.qza
--o-error-correction-details demux-details.qza
--verbose

My dada2 command:

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-full.qza
--p-trunc-len-f 187
--p-trunc-len-r 133
--p-max-ee-f 1
--p-max-ee-r 1
--p-min-overlap 20
--p-pooling-method independent
--p-chimera-method consensus
--o-table table.qza
--o-representative-sequences rep-seqs.qza
--o-denoising-stats denoising-stats.qza
--verbose

Example aberrant sequences:

6a8b0cbf68229d21edd2d74dc3ac7642
TCCTTCTTTTTCCCTCTTTCCTCTTCCTTCCTTTTCTTCCCTCTCCCTCCTTCTTTTTTTTCCTTCCTCTTTTCCCTCCCCTTTCTCCCCCTTTTCCCTTCCTCTTATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG

23264447185f6bed89f70d40c35d279f
TCCTTCTTTTTCCCTCTTTCTCCTTCTTTCTTTTTCTTCCCTCTCTTTCCTTCTTTTTTCTCCTTCTTCTTTTCCCTCCTTCTTCTTCCCCTTCTCCTTTCCTCTTAAACTGGATAACTTGAGTGCAGAAGAGGGTAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTACCTGGTCTGCAACTGACGCTGAGACTCGAAAGCATGGGTAGCGAACAGG

You could check with the sequencing facility that prepared the sequences to confirm what kinds of non-biological adapter sequence you should expect to find in the reads.

The analytical steps you present here all look reasonable to me, although if the sequencing center does report back that there are additional adapters for you to remove, you could consider adding a q2-cutadapt filter-paired step right before running q2-dada2, to remove those adapters.

:qiime2: