Hello QIIME2 community,
I am analysing 16S rRNA amplicon data generated using the Thermo Ion 16S Metagenomics Kit on an Ion Torrent/Thermo platform. The data are single-end FASTQ files, with 34 samples in total: 22 controls and 12 cases.
The Thermo kit targets multiple 16S regions, and I identified the following primer sequences in the data:
V9_reverse: GCGTCGTAGTCCGGATTGG
V4_reverse: GGACTACCAGGGTATCTAATCCTGT
V4_forward: CCAGCAGCCGCGGTAATA
V3_forward: ACTGAGACACGGTCCARACT
V9_forward: GTTACGACTTCACCCCAGTCA
V67_forward: ACAAGCGGHGGARCATGT
V8_reverse: CGATTACTAGCGAYTCCGACTTCA
V2_reverse: GCTGCCTCCCGTAGGAGT
V67_reverse: GACGTCATCCCCACCTTCC
V3_reverse: GTATTACCGCGGCTGCTG
V2_forward: GGCGSACGGGTGAGTAA
V8_forward: GYTGTCGTCAGCTCGTGT
I compared three preprocessing approaches:
A: raw reads + dada2 denoise-pyro + trim-left 15 + no truncation
B: primer-specific cutadapt trimming + dada2 denoise-pyro + trim-left 0 + no truncation
C: primer-specific cutadapt trimming + dada2 denoise-pyro + trim-left 15 + no truncation
The main commands used for Option B were:
qiime cutadapt trim-single
--i-demultiplexed-sequences demux-single.qza
--p-front GCGTCGTAGTCCGGATTGG
--p-front GGACTACCAGGGTATCTAATCCTGT
--p-front CCAGCAGCCGCGGTAATA
--p-front ACTGAGACACGGTCCARACT
--p-front GTTACGACTTCACCCCAGTCA
--p-front ACAAGCGGHGGARCATGT
--p-front CGATTACTAGCGAYTCCGACTTCA
--p-front GCTGCCTCCCGTAGGAGT
--p-front GACGTCATCCCCACCTTCC
--p-front GTATTACCGCGGCTGCTG
--p-front GGCGSACGGGTGAGTAA
--p-front GYTGTCGTCAGCTCGTGT
--p-error-rate 0.12
--p-overlap 12
--p-cores 0
--o-trimmed-sequences demux-single_primertrimmed.qza
Then:
qiime dada2 denoise-pyro \
--i-demultiplexed-seqs demux-single_primertrimmed.qza \
--p-trim-left 0 \
--p-trunc-len 0 \
--o-table B_table_primertrim_pyro_trim0_notrunc.qza \
--o-representative-sequences B_repseqs_primertrim_pyro_trim0_notrunc.qza \
--o-denoising-stats B_denoising_primertrim_pyro_trim0_notrunc.qza \
--o-base-transition-stats B_base_transition_primertrim_pyro_trim0_notrunc.qza \
--p-n-threads 0
I selected Option B because exact primer trimming seemed more biologically conservative than trimming a fixed 15 bp from all reads after primer removal. The first bases after primer trimming did not show poor quality, so I did not apply additional trim-left 15.
The Option B DADA2 output looked like this:
Samples retained: 34/34
Number of ASVs/features: 9,262
Total frequency: 14,999,201
Minimum frequency/sample: 141,904
Median frequency/sample: 417,467.5
Maximum frequency/sample: 854,763
Non-chimeric retention: approximately 55–65%
For taxonomy, I used:
qiime feature-classifier classify-sklearn --i-classifier silva-138-99-nb-classifier.qza --i-reads B_repseqs_primertrim_pyro_trim0_notrunc.qza --o-classification B_taxonomy_silva_nb.qza
After taxonomy assignment, I removed non-bacterial/off-target features:
Eukaryota
mitochondria
chloroplast
Before filtering, I observed:
Eukaryota: 2,730 ASVs, 4,454,248 reads
Mitochondria: 7 ASVs, 1,928 reads
Chloroplast: 1 ASV, 11 reads
After filtering, my final bacterial table was:
Samples: 34
Bacterial ASVs/features: 6,524
Total bacterial reads: 10,543,014
Minimum/sample: 96,946
Median/sample: 275,456
Maximum/sample: 687,220
I then rarefied to 90,000 reads/sample and ran alpha/beta diversity.
My questions are:
-
Does this workflow look appropriate for Thermo Ion 16S Metagenomics Kit data?
-
Is
dada2 denoise-pyrowith--p-trunc-len 0reasonable here, given the Ion Torrent platform and high read retention? -
For taxonomy, is using the full-length SILVA 138 99% Naive Bayes classifier appropriate for this mixed-region Thermo 16S kit?
-
Since SILVA reference sequences are often much longer than my reads/ASVs, would it be better to train a custom classifier using extracted reads around the Thermo primer regions, or is the full-length SILVA classifier acceptable?
-
Would VSEARCH-based taxonomy assignment be preferable for this type of multi-region Thermo 16S data?
-
I also did alpha and beta diversity.
Alpha diversity metric p-value Interpretation Shannon 0.047 Significant / borderline Evenness 0.0158 Significant Observed features 0.871 Not significant Faith PD 0.589 Not significant
| Metric | Type | p-value | Interpretation |
|---|---|---|---|
| Bray-Curtis | Abundance-based | 0.028 | Significant |
| Jaccard | Presence/absence | 0.006 | Significant |
| Unweighted UniFrac | Phylogenetic presence/absence | 0.097 | Not significant |
| Weighted UniFrac | Phylogenetic abundance | 0.076 | Borderline, not significant |
Please let me me know if the processing is correct or not and also if there are other downstream analysis that can be done using OTU tables
Any suggestions on improving this pipeline would be greatly appreciated.