Over 120K "unique" sequences called in Microbiome dataset

Hi @neempatel,

There may very well be that many unique sequences! However, here is a quick drive-by comment :racing_car:.

I noticed you ran cutadapt as follows:

I would highly suggest you set the following additional parameters:

 --p-match-adapter-wildcards \
 --discard-untrimmed \

Using --p-match-adapter-wildcards will ensure that the ambiguous IUPAC bases in your primer sequence will correctly match the bases in your reads. Otherwise, they will be counted as differences.

Using --discard-untrimmed, will discard any pair of reads in which either primer sequence can not be detected from the paired reads. I consider this an essential QA/QC step. For example, if the quality of the sequence where the primer resides is poor (for either R1 and/or R2), cutadapt may be unable to detect the primer. Resulting in that portion of the sequence remaining within the read and ultimatley remaining within your dataset, unless you choose to discard the read pair. Which is what I'd suggest.

Otherwise, if these (untrimmed) reads are retained, then you'll likely keep longer reads in the output (relative to the trimmed reads). This in turn artificially inflating the number of sequence variants. That is, we are trying to obtain single-base resolution here, which means any indels, will likely be considered a sequence variant.

-Cheers!
-Mike