Cluster features closed reference increases features with decreasing similarity

Hello,

I wanted to reduce feature space by reducing similarity thresholds with closed reference OTU picking through q2-vsearch's cluster-features-closed-reference. I was expecting to see the number of unique features to decrease as the similarity threshold was relaxed (consistent with reduced similarity levels of the Greengenes 13_8 OTUs). However, the number of unique features seems to increase as similarity decreases which I'm not sure how to interpret.

The observation immediately below is based off 100 arbitrary samples from the AGP using Greengenes2 2022.10:

# percent_similarity : (number_of_features, number_of_samples)
0.80 : (3865, 100)
0.81 : (3864, 100)
0.82 : (3862, 100)
0.83 : (3856, 100)
0.84 : (3850, 100)
0.85 : (3848, 100)
0.86 : (3836, 100)
0.87 : (3819, 100)
0.88 : (3806, 100)
0.89 : (3773, 100)
0.90 : (3751, 100)
0.91 : (3686, 100)
0.92 : (3644, 100)
0.93 : (3526, 100)
0.94 : (3458, 100)
0.95 : (3300, 100)
0.96 : (3206, 100)
0.97 : (2990, 100)
0.98 : (2877, 100)
0.99 : (2601, 100)

Out of caution, a separate database was also tested (Greengenes 13.8), and the same behavior resulted. Additionally, a separate input dataset was tested as well.

All code run used QIIME 2 2023.2 on a Linux system with SLURM. The code to reproduce is follows.

run.sh

#!/bin/bash

set -x
set -e

ctx=Deblur_2021.09-Illumina-16S-V4-150nt-ac8c0b
redbiom fetch samples-contained \
    --context ${ctx} | \
        grep "^10317\." | \
        head -n 100 | \
        redbiom fetch samples \
            --context ${ctx} \
            --resolve-ambiguities merge \
            --output 100agp.biom

biom table-ids \
    -i 100agp.biom \
    --observations | \
        awk '{ print ">" $1 "\n" $1 }' > 100agp.fna

qiime tools import \
    --input-path 100agp.biom \
    --output-path 100agp.biom.qza \
    --type FeatureTable[Frequency]

qiime tools import \
    --input-path 100agp.fna \
    --output-path 100agp.fna.qza \
    --type FeatureData[Sequence]

if [[ ! -f 2022.10.backbone.full-length.fna.qza ]];
then
    wget http://ftp.microbio.me/greengenes_release/2022.10/2022.10.backbone.full-length.fna.qza
fi

if [[ ! -f 99_otus.fasta ]];
then
    wget http://ftp.microbio.me/greengenes_release/gg_13_8_otus/rep_set/99_otus.fasta
fi

qiime tools import \
    --input-path 99_otus.fasta \
    --output-path 99_otus.fasta.qza \
    --type FeatureData[Sequence]

g2=$(sbatch \
        --parsable \
        --export name=2022.10,ref=2022.10.backbone.full-length.fna.qza \
        vsearch.sbatch)
g1=$(sbatch \
        --parsable \
        --export name=gg138,ref=99_otus.fasta.qza \
        vsearch.sbatch)

sbatch \
    --dependency=afterok:${g2} \
    --export name=2022.10 \
    assess.sbatch
sbatch \
    --dependency=afterok:${g1} \
    --export name=gg138 \
    assess.sbatch

vsearch.sbatch

#!/bin/bash
#SBATCH -J vsearch
#SBATCH -N 1
#SBATCH -c 2
#SBATCH --mem 8gb
#SBATCH --time 2:00:00
#SBATCH --array 80-99

source activate qiime2-2023.2

perc=0.${SLURM_ARRAY_TASK_ID}
qiime vsearch cluster-features-closed-reference \
    --i-sequences 100agp.fna.qza \
    --i-table 100agp.biom.qza \
    --i-reference-sequences ${ref} \
    --p-perc-identity ${perc} \
    --output-dir ${name}-100agp-cref-$perc \
    --p-threads 2

assess.sbatch

#!/bin/bash
#SBATCH -J assess
#SBATCH -N 1
#SBATCH -c 1

source activate qiime2-2023.2

for i in $(seq 80 99)
do
    f=${name}-100agp-cref-0.${i}/clustered_table.qza
    what=$(python -c "import qiime2,biom;print(qiime2.Artifact.load('${f}').view(biom.Table).shape)")
    echo "0.${i} : ${what}"
done
1 Like

I may have opened this thread prematurely. What would explain this is if the number of unmatched features decreases, which it does.

3 Likes

Thank you for the code examples.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.