I am new to qiime2 and metabarcoding analysis. I have a paired-end 16S dataset of diet samples generated with a combination of 2 primer pairs in multiplex to target two distinct groups of prey. Expected amplicon size should range between 260 and 310 bp. Fastqc files are from Illumina PE 250 and I received them already demultiplexed from the sequencing facility.
After running DADA2 denoise-paired, I realised primers were still attached to reads and there was a considerable amount of unexpectedly long sequences (>400 bp) retained on representativeseqs list which are actually contaminants (blast is poor and points to bacteria). These sequences have no primer match in it at the beginning. Instead, they start with a long string of CCs, have a short sequence in the middle with poor blast and end with another long string of GGs.
Seven-number summary of sequence lengths indicates that sequences >400 correspond to 75% percentile. I have two blanks included which might inflate the amount of such contaminants in the whole dataset.
I repeated denoise by trimming the primers length at the 5’ but this does not discard >400bp contaminants.The percentage of input non-chimeric after denoise is lower or much lower than 75% for most samples.
I think it would be better to discard these contaminants prior to denoise. I thought of using cutadapt trim-paired to remove primers and contaminant reads without primer in it, but I don’t know how to do it with more than one set of primers:
Ceph_ 16S_F +
I thought I could use wildcards, but primers seem quite different to me to do that. It would require a lot of ambiguities.
I thought of removing primers sequentially, but then I cannot use the option --p-discard-untrimmed to get rid of contaminant reads.
What would be the best solution to filter out contaminants before DADA 2 denoise?