Processing Thermo Ion 16S Metagenomics Kit data in QIIME2: primer trimming, denoise-pyro, and SILVA classifier choice

Hello QIIME2 community,

I am analysing 16S rRNA amplicon data generated using the Thermo Ion 16S Metagenomics Kit on an Ion Torrent/Thermo platform. The data are single-end FASTQ files, with 34 samples in total: 22 controls and 12 cases.

The Thermo kit targets multiple 16S regions, and I identified the following primer sequences in the data:

V9_reverse: GCGTCGTAGTCCGGATTGG
V4_reverse: GGACTACCAGGGTATCTAATCCTGT
V4_forward: CCAGCAGCCGCGGTAATA
V3_forward: ACTGAGACACGGTCCARACT
V9_forward: GTTACGACTTCACCCCAGTCA
V67_forward: ACAAGCGGHGGARCATGT
V8_reverse: CGATTACTAGCGAYTCCGACTTCA
V2_reverse: GCTGCCTCCCGTAGGAGT
V67_reverse: GACGTCATCCCCACCTTCC
V3_reverse: GTATTACCGCGGCTGCTG
V2_forward: GGCGSACGGGTGAGTAA
V8_forward: GYTGTCGTCAGCTCGTGT

I compared three preprocessing approaches:

A: raw reads + dada2 denoise-pyro + trim-left 15 + no truncation
B: primer-specific cutadapt trimming + dada2 denoise-pyro + trim-left 0 + no truncation
C: primer-specific cutadapt trimming + dada2 denoise-pyro + trim-left 15 + no truncation

The main commands used for Option B were:

qiime cutadapt trim-single
--i-demultiplexed-sequences demux-single.qza
--p-front GCGTCGTAGTCCGGATTGG
--p-front GGACTACCAGGGTATCTAATCCTGT
--p-front CCAGCAGCCGCGGTAATA
--p-front ACTGAGACACGGTCCARACT
--p-front GTTACGACTTCACCCCAGTCA
--p-front ACAAGCGGHGGARCATGT
--p-front CGATTACTAGCGAYTCCGACTTCA
--p-front GCTGCCTCCCGTAGGAGT
--p-front GACGTCATCCCCACCTTCC
--p-front GTATTACCGCGGCTGCTG
--p-front GGCGSACGGGTGAGTAA
--p-front GYTGTCGTCAGCTCGTGT
--p-error-rate 0.12
--p-overlap 12
--p-cores 0
--o-trimmed-sequences demux-single_primertrimmed.qza

Then:

qiime dada2 denoise-pyro \
  --i-demultiplexed-seqs demux-single_primertrimmed.qza \
  --p-trim-left 0 \
  --p-trunc-len 0 \
  --o-table B_table_primertrim_pyro_trim0_notrunc.qza \
  --o-representative-sequences B_repseqs_primertrim_pyro_trim0_notrunc.qza \
  --o-denoising-stats B_denoising_primertrim_pyro_trim0_notrunc.qza \
  --o-base-transition-stats B_base_transition_primertrim_pyro_trim0_notrunc.qza \
  --p-n-threads 0

I selected Option B because exact primer trimming seemed more biologically conservative than trimming a fixed 15 bp from all reads after primer removal. The first bases after primer trimming did not show poor quality, so I did not apply additional trim-left 15.

The Option B DADA2 output looked like this:

Samples retained: 34/34
Number of ASVs/features: 9,262
Total frequency: 14,999,201
Minimum frequency/sample: 141,904
Median frequency/sample: 417,467.5
Maximum frequency/sample: 854,763
Non-chimeric retention: approximately 55–65%

For taxonomy, I used:

qiime feature-classifier classify-sklearn --i-classifier silva-138-99-nb-classifier.qza --i-reads B_repseqs_primertrim_pyro_trim0_notrunc.qza --o-classification B_taxonomy_silva_nb.qza

After taxonomy assignment, I removed non-bacterial/off-target features:

Eukaryota
mitochondria
chloroplast
Before filtering, I observed:

Eukaryota: 2,730 ASVs, 4,454,248 reads
Mitochondria: 7 ASVs, 1,928 reads
Chloroplast: 1 ASV, 11 reads

After filtering, my final bacterial table was:


Samples: 34
Bacterial ASVs/features: 6,524
Total bacterial reads: 10,543,014
Minimum/sample: 96,946
Median/sample: 275,456
Maximum/sample: 687,220

I then rarefied to 90,000 reads/sample and ran alpha/beta diversity.

My questions are:

  1. Does this workflow look appropriate for Thermo Ion 16S Metagenomics Kit data?

  2. Is dada2 denoise-pyro with --p-trunc-len 0 reasonable here, given the Ion Torrent platform and high read retention?

  3. For taxonomy, is using the full-length SILVA 138 99% Naive Bayes classifier appropriate for this mixed-region Thermo 16S kit?

  4. Since SILVA reference sequences are often much longer than my reads/ASVs, would it be better to train a custom classifier using extracted reads around the Thermo primer regions, or is the full-length SILVA classifier acceptable?

  5. Would VSEARCH-based taxonomy assignment be preferable for this type of multi-region Thermo 16S data?

  6. I also did alpha and beta diversity.

    Alpha diversity metric p-value Interpretation
    Shannon 0.047 Significant / borderline
    Evenness 0.0158 Significant
    Observed features 0.871 Not significant
    Faith PD 0.589 Not significant
Metric Type p-value Interpretation
Bray-Curtis Abundance-based 0.028 Significant
Jaccard Presence/absence 0.006 Significant
Unweighted UniFrac Phylogenetic presence/absence 0.097 Not significant
Weighted UniFrac Phylogenetic abundance 0.076 Borderline, not significant

Please let me me know if the processing is correct or not and also if there are other downstream analysis that can be done using OTU tables

Any suggestions on improving this pipeline would be greatly appreciated.

1 Like

You found the sequences!

We were looking for these 6 years ago.
At the time, some forums users were working with the kits and were having a hard time partly because the primers and regions were not published.

The fact that you got this all the way through DADA2 is remarkable.

Good job!


To your questions:

It's a multi-region kit, so any workflow is going to be non-standard. It has to be!

The taxonomy assignment looks promising. Did you include any positive controls?

Sure. Cutadapt has other settings you could use for extra trimming, if that was a problem.

Yes! You need full length because you have all the regions. I don't know if SILVA has been benchmarked with multiple regions, but you could do this with the positive controls!

Yes, per-region classifiers should perform better, but you would need to make quite a few of them! I think it's easier to use a full-length skl classifier or switch to a search+LCA classifier like vsearch. Speaking of which...

I'm not sure any of the classifiers were optimized for multi-region classification. I'm a big fan of vsearch because it's conceptually simple and runs fast. Try it out and report back!


It's hard to comment more about alpha and beta diversity without know more about the study design. I'll wait to write more. :microbe:

Thank you for sharing your pipeline with us. And thank you for posting those primers after 6 years.

Hi @colinbrislawn

Grateful to you for response.

he taxonomy assignment looks promising. Did you include any positive controls?

With respect to this, could you please clarify on what parameters you are telling that the taxonomy assignment looks promising? Also, we do not have any positive controls.

dada2 denoise-pyro with --p-trunc-len 0 reasonable here,Sure. Cutadapt has other settings you could use for extra trimming, if that was a problem.

Is there any basis to identify if there is needs for trimming? Any speciifc graph that could help us to infer?

Yes, per-region classifiers should perform better, but you would need to make quite a few of them! I think it's easier to use a full-length skl classifier or switch to a search+LCA classifier like vsearch. Speaking of which...

I will definitely share an update once the VSEARCH results are ready. Meanwhile, I am also waiting for expert input on which classifier would be the most suitable for this 16S run.

Also, @colinbrislawn

We ran samples in three different batches at three different time points. Are you aware of the best practices to pool samples