Good classification, poor clustering

Hi Team!

I've got a weird conundrum, and I'm hoping for some hive mind wisdom.

I've got a data set that is 16S V34 Illumina paired end data and I'm running in qiime2-2022.2. Because of several project related constraints, I'm sort of stuck with this version.

My reads are part of a meta analysis, and so I've been clustering them closed reference. My current pipeline for hte data is:

  1. Trim primers using cutadapt; keep untrimmed reads since they were trimmed before processing
  2. Join paired ends using q2-vsearch
  3. Quality filter iwth q2-quality-filter using default parameters
  4. Denosing using deblur-16S, trimming like the first 15nt and a reasonable length for the ASVs
  5. Apply a full length Silva 138.1 feature classifier to the ASVs and check the taxonomy using classify-sklearn
  6. Cluster the data closed reference at 99% against the same Silva 138.1 reference sequences I used to build the classifier using q2-vserach.

When I look at the high level ASV taxonomy, it looks reasonably good. The community composition reflects the expected enviroment, there's reasonable variation, and it passes the sniff test.

None of the representative sequences are clustering against the reference database, and the ones I do get to cluster don't make sense. (Mostly Bacilli for a fecal community.)

I've tried:

  1. Switching the primer trimming (no dice)
  2. Running single and paired ends
  3. Changingt the denoising trim length
  4. Relaxing my clustering identity
  5. Allowing mixed orientation reads
  6. Crying

Thus far, nothing has worked.

I'm hoping someone here might have some brilliant insight?



Hi @jwdebelius ,
I think the issue might just be this step:

99% is quite high, and could lead to many failures, esp. if the reads are just a little bit noisy. I recommend reducing this to see if you start getting an acceptable number of reads passing. Looks like you already did this:

but how much did you relax, and what was the effect?


Hi @Nicholas_Bokulich,

Thanks for your brilliant insight! I think I went to like 98%, which clearly wasn't enough. When I reduced to 97%, I rescue a large portion of samples and the samples look like stool. I was hoping to stay higher, but it's at least justifiable.