I'm currently analyzing 16S rRNA sequencing data, and I've observed a substantial number of ASVs that seem unevenly distributed across my dataset. Even in biological replicate samples (those with identical amplicon libraries sequenced twice), I've noticed a considerable diversity in ASVs.
My dataset is derived from 16S rRNA sequencing of in vivo-cultured biofilms, with approximately 20 replicate samples for each biofilm group. Interestingly, these replicates show significant variation in ASVs. In total, my dataset comprises 607 ASVs across 207 samples, post data-trimming based on read count and ASV abundance. (To be retained in the dataset, ASVs must be present in two or more samples at a relative abundance of 0.01%. When I increase the trimming threshold to five samples and 0.1%, I lose 60% of the ASVs.)
One would anticipate that replicate samples would share a substantial number of the same ASVs. While it's plausible that these samples are inherently more diverse than expected, I'm contemplating whether there could be any issues with my QIIME2/DADA2 analysis. Is it possible that my DADA2 parameters are misconfigured? I've included a snippet of my script below for reference—note that the sequences are imported into QIIME2 with sequencing adapters already trimmed using Cutadapt, and reads are trimmed with Sickle based on a quality score of 20.
Is there anything that I am obviously doing wrong or has anyone experienced anything like this before?
primers
primerf="CCGAGTTTGATCMTGGCTCAG"
primerr="TGCTGCCTCCCGTAGGAGT"
truncation lengths
denoise_truncf=230
denoise_truncr=230
classifier=0 #keep as 0 recommended
import the data
qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path "$PATH_TO_MANIFEST"
--output-path "$PATH_TO_OUTPUT/paired-end-demux.qza"
--input-format PairedEndFastqManifestPhred33
Visualise the samples loaded previously with below command
qiime demux summarize
--i-data "$PATH_TO_OUTPUT/paired-end-demux.qza"
--o-visualization "$PATH_TO_OUTPUT/paired-end-demux.qzv"
Illumina adapters already trimmed from the data, but still need to trim the PCR primers for the V1-2 region of 16s rRNA gene. Because our amplicon is ~320 bp, our pair-end reads should overlap in the middle (2x250bp long). Use cut adapt to first remove the PCR primers, and can visualise again to check
qiime cutadapt trim-paired
--i-demultiplexed-sequences "$PATH_TO_OUTPUT/paired-end-demux.qza"
--p-front-f "$primerf"
--p-front-r "$primerr"
--o-trimmed-sequences "$PATH_TO_OUTPUT/paired-end-demux.trim.qza"
qiime demux summarize
--i-data "$PATH_TO_OUTPUT/paired-end-demux.trim.qza"
--o-visualization "$PATH_TO_OUTPUT/paired-end-demux.trim.qzv"
qiime dada2 denoise-paired
--i-demultiplexed-seqs "$PATH_TO_OUTPUT/paired-end-demux.trim.qza"
--p-trunc-len-f "$denoise_truncf"
--p-trunc-len-r "$denoise_truncr"
--o-representative-sequences "$PATH_TO_OUTPUT/rep-seqs-dada2.qza"
--o-table "$PATH_TO_OUTPUT/table-dada2.qza"
--o-denoising-stats "$PATH_TO_OUTPUT/dada2-stats.qza"
--verbose
visualise the ASV table
qiime feature-table summarize
--i-table "$PATH_TO_OUTPUT/table-dada2.qza"
--o-visualization "$PATH_TO_OUTPUT/table-dada2.qzv"
--m-sample-metadata-file "$PATH_TO_METADATA"
Visualise the denoising stats
qiime metadata tabulate
--m-input-file "$PATH_TO_OUTPUT/dada2-stats.qza"
--o-visualization "$PATH_TO_OUTPUT/dada2-stats.qzv"
qiime tools import --input-path
/pub59/rsoftley/GreenGenes/99_otus.fasta
--output-path "$PATH_TO_OUTPUT/gg-13-8.99_otus.qza"
--type 'FeatureData[Sequence]'
then need to import the taxonomy data
qiime tools import --input-path
/pub59/rsoftley/GreenGenes/99_otu_taxonomy.txt
--output-path "$PATH_TO_OUTPUT/gg-13-8.99.taxa.qza"
--input-format HeaderlessTSVTaxonomyFormat
--type 'FeatureData[Taxonomy]'
qiime feature-classifier extract-reads
--i-sequences "$PATH_TO_OUTPUT/gg-13-8.99_otus.qza"
--p-f-primer "$primerf"
--p-r-primer "$primerr"
--p-trunc-len "$classifier"
--o-reads "$PATH_TO_OUTPUT/gg-13-8.99.ref.seqs.qza"
now need to train the classifier, this uses Naive Bayes which is a very complicated algorithm
qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads "$PATH_TO_OUTPUT/gg-13-8.99.ref.seqs.qza"
--i-reference-taxonomy "$PATH_TO_OUTPUT/gg-13-8.99.taxa.qza"
--o-classifier "$PATH_TO_OUTPUT/gg.classifier.trained.qza"
we can now use our trained classifier to assign the taxonomy to our ASV representative sequences
use the 0.7 confidence threshold for limiting taxonomic depth, each taxonomic level must reach a level of => 0.7 to be classified
e.g. Genus level assignment >0.7 but species is <0.7, so will be classified to Genus level only
qiime feature-classifier classify-sklearn
--i-classifier "$PATH_TO_OUTPUT/gg.classifier.trained.qza"
--p-confidence 0.7
--i-reads "$PATH_TO_OUTPUT/rep-seqs-dada2.qza"
--o-classification "$PATH_TO_OUTPUT/gg.taxonomy.sklearn.qza"