DADA2 generating different ASVs from the same sample?

I'm currently analyzing 16S rRNA sequencing data, and I've observed a substantial number of ASVs that seem unevenly distributed across my dataset. Even in biological replicate samples (those with identical amplicon libraries sequenced twice), I've noticed a considerable diversity in ASVs.

My dataset is derived from 16S rRNA sequencing of in vivo-cultured biofilms, with approximately 20 replicate samples for each biofilm group. Interestingly, these replicates show significant variation in ASVs. In total, my dataset comprises 607 ASVs across 207 samples, post data-trimming based on read count and ASV abundance. (To be retained in the dataset, ASVs must be present in two or more samples at a relative abundance of 0.01%. When I increase the trimming threshold to five samples and 0.1%, I lose 60% of the ASVs.)

One would anticipate that replicate samples would share a substantial number of the same ASVs. While it's plausible that these samples are inherently more diverse than expected, I'm contemplating whether there could be any issues with my QIIME2/DADA2 analysis. Is it possible that my DADA2 parameters are misconfigured? I've included a snippet of my script below for reference—note that the sequences are imported into QIIME2 with sequencing adapters already trimmed using Cutadapt, and reads are trimmed with Sickle based on a quality score of 20.

Is there anything that I am obviously doing wrong or has anyone experienced anything like this before?

primers

primerf="CCGAGTTTGATCMTGGCTCAG"
primerr="TGCTGCCTCCCGTAGGAGT"

truncation lengths

denoise_truncf=230
denoise_truncr=230
classifier=0 #keep as 0 recommended

import the data

qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path "$PATH_TO_MANIFEST"
--output-path "$PATH_TO_OUTPUT/paired-end-demux.qza"
--input-format PairedEndFastqManifestPhred33

Visualise the samples loaded previously with below command

qiime demux summarize
--i-data "$PATH_TO_OUTPUT/paired-end-demux.qza"
--o-visualization "$PATH_TO_OUTPUT/paired-end-demux.qzv"

Illumina adapters already trimmed from the data, but still need to trim the PCR primers for the V1-2 region of 16s rRNA gene. Because our amplicon is ~320 bp, our pair-end reads should overlap in the middle (2x250bp long). Use cut adapt to first remove the PCR primers, and can visualise again to check

qiime cutadapt trim-paired
--i-demultiplexed-sequences "$PATH_TO_OUTPUT/paired-end-demux.qza"
--p-front-f "$primerf"
--p-front-r "$primerr"
--o-trimmed-sequences "$PATH_TO_OUTPUT/paired-end-demux.trim.qza"

qiime demux summarize
--i-data "$PATH_TO_OUTPUT/paired-end-demux.trim.qza"
--o-visualization "$PATH_TO_OUTPUT/paired-end-demux.trim.qzv"

qiime dada2 denoise-paired
--i-demultiplexed-seqs "$PATH_TO_OUTPUT/paired-end-demux.trim.qza"
--p-trunc-len-f "$denoise_truncf"
--p-trunc-len-r "$denoise_truncr"
--o-representative-sequences "$PATH_TO_OUTPUT/rep-seqs-dada2.qza"
--o-table "$PATH_TO_OUTPUT/table-dada2.qza"
--o-denoising-stats "$PATH_TO_OUTPUT/dada2-stats.qza"
--verbose

visualise the ASV table

qiime feature-table summarize
--i-table "$PATH_TO_OUTPUT/table-dada2.qza"
--o-visualization "$PATH_TO_OUTPUT/table-dada2.qzv"
--m-sample-metadata-file "$PATH_TO_METADATA"

Visualise the denoising stats

qiime metadata tabulate
--m-input-file "$PATH_TO_OUTPUT/dada2-stats.qza"
--o-visualization "$PATH_TO_OUTPUT/dada2-stats.qzv"

qiime tools import --input-path
/pub59/rsoftley/GreenGenes/99_otus.fasta
--output-path "$PATH_TO_OUTPUT/gg-13-8.99_otus.qza"
--type 'FeatureData[Sequence]'

then need to import the taxonomy data

qiime tools import --input-path
/pub59/rsoftley/GreenGenes/99_otu_taxonomy.txt
--output-path "$PATH_TO_OUTPUT/gg-13-8.99.taxa.qza"
--input-format HeaderlessTSVTaxonomyFormat
--type 'FeatureData[Taxonomy]'

qiime feature-classifier extract-reads
--i-sequences "$PATH_TO_OUTPUT/gg-13-8.99_otus.qza"
--p-f-primer "$primerf"
--p-r-primer "$primerr"
--p-trunc-len "$classifier"
--o-reads "$PATH_TO_OUTPUT/gg-13-8.99.ref.seqs.qza"

now need to train the classifier, this uses Naive Bayes which is a very complicated algorithm

qiime feature-classifier fit-classifier-naive-bayes
--i-reference-reads "$PATH_TO_OUTPUT/gg-13-8.99.ref.seqs.qza"
--i-reference-taxonomy "$PATH_TO_OUTPUT/gg-13-8.99.taxa.qza"
--o-classifier "$PATH_TO_OUTPUT/gg.classifier.trained.qza"

we can now use our trained classifier to assign the taxonomy to our ASV representative sequences
use the 0.7 confidence threshold for limiting taxonomic depth, each taxonomic level must reach a level of => 0.7 to be classified
e.g. Genus level assignment >0.7 but species is <0.7, so will be classified to Genus level only

qiime feature-classifier classify-sklearn
--i-classifier "$PATH_TO_OUTPUT/gg.classifier.trained.qza"
--p-confidence 0.7
--i-reads "$PATH_TO_OUTPUT/rep-seqs-dada2.qza"
--o-classification "$PATH_TO_OUTPUT/gg.taxonomy.sklearn.qza"

Hello @rowsoft17,

I'm currently analyzing 16S rRNA sequencing data, and I've observed a substantial number of ASVs that seem unevenly distributed across my dataset. Even in biological replicate samples (those with identical amplicon libraries sequenced twice), I've noticed a considerable diversity in ASVs.

Sequencing the same amplicon library twice would be a technical replicate, is this what you're doing? Or are you assuming that repeated extractions from the same culture will yield identical libraries, because this certainly won't be the case (though you would expect similarity). Was everything sequenced in the same run?

Your workflow looks totally reasonable as a whole. It could help trouble shoot if you posted some of the intermediate results, such as the demux visualizer and the dada2 stats (if you're comfortable doing so).

3 Likes

Hi @colinvwood,

To clarify - the data consists of technical replicates (same library sequenced twice) and 'biological' replicates (biofilms grown in same fermenter under identical conditions for the exact same time, DNA extracted via same protocol, same library prep, same sequencing runs). Everything has been sequenced in the same MiSeq 250bp runs (everything twice), I performed the initial PCR and the sequencing facility performed the second-round PCR for adapter ligation. I actually kept these two runs separate until I moved the data into phyloseq for downstream analysis.

My analysis consists of removal of very low abundance ASVs (in less than 2 samples at 0.01% relative abundance) and then running the package 'Decontam' to remove any potential contaminating sequences. After this, my biofilm samples have 607 ASVs across ~210 libraries (105 samples that have been sequenced twice and are yet to be merged). The vast majority of these ASVs are shared amongst only 2 libraries (the two technical replicates). Whilst one would expect some variation in the dataset, the amount of ASVs shared among biological replicates is incredibly low and makes me think something is amiss.

Importantly, many of these ASVs have the same taxonomic ID. For instance, a group of biological replicates has near identical taxonomic composition (relative abundance) but the ASVs are very different.

My initial thought was adapter/barcode contamination on the ends of the reads. The reads comes from the sequencing provider already trimmed of adapters and barcodes, however, when I inspect the demultiplexed reads (attached) the length is 250nt for both F and R reads - would it not be expected to be shorter if the adapters have been trimmed? I do not currently have the adapter sequences to re-do this myself.

The files I've provided are for pre and post primer trimming, the ASV table and dada2 stats. All of the biofilms are cultivated from the same inoculum, so there should be a decent degree of common ASVs. The lack of shared ASVs between inoculum and biofilms was how I discovered the ASV variation within the experimental biofilms themselves.

I have also tried playing around with truncation lengths and trimming of the 3' end, which has actually increased the variety of ASVs. I am currently running dada2 with these parameters (just to see what happens):

qiime dada2 denoise-paired \
--i-demultiplexed-seqs "$PATH_TO_INPUT/all_trim_paired-end-demux.trim.qza" \
--p-trunc-len-f 170 \
--p-trunc-len-r 170 \
--p-trim-left-f 50 \
--p-trim-left-r 50 \
--o-representative-sequences "$PATH_TO_OUTPUT/all_trim_rep-seqs-dada2.qza" \
--o-table "$PATH_TO_OUTPUT/all_trim_table-dada2.qza" \
--o-denoising-stats "$PATH_TO_OUTPUT/all_trim_dada2-stats.qza" \
--p-n-threads 16 \
--verbose

I have also tried clustering the ASVs based on identity. This had little effect on the data.

qiime vsearch cluster-features-de-novo \
--i-sequences "$PATH_TO_INPUT/allexp_rep-seqs-dada2.qza" \
--i-table "$PATH_TO_INPUT/allexp_table-dada2.qza" \
--p-perc-identity 1.0 \
--o-clustered-table "$PATH_TO_OUTPUT/allexp_clustered_table-dada2.qza" \
--o-clustered-sequences "$PATH_TO_OUTPUT/allexp_clustered_rep-seqs-dada2.qza"

FYI - the .qzv files are for sequence data across multiple experiments, so there are many more ASVs than detailed here. However, the trend of having exorbitant ASVs is seen across all of the experiments.
allexp_paired-end-demux.trim.qzv (348.0 KB)
allexp_paired-end-demux.qzv (342.3 KB)
allexp_table-dada2.qzv (654.9 KB)
allexp_dada2-stats.qzv (1.3 MB)

Hello @rowsoft17,

The vast majority of these ASVs are shared amongst only 2 libraries (the two technical replicates). Whilst one would expect some variation in the dataset, the amount of ASVs shared among biological replicates is incredibly low and makes me think something is amiss.

I see. To explain this there's some variable across biological replicates but not across technical replicates that isn't actual composition (hopefully). From your explanation, there's nothing obvious on the wet lab side of things that I can think of.

Importantly, many of these ASVs have the same taxonomic ID. For instance, a group of biological replicates has near identical taxonomic composition (relative abundance) but the ASVs are very different.

At which taxonomic levels are they similar? All the way to species?

One thing that came to mind here is sequencing quality--base calling errors could shuffle your ASVs around slightly but maintain similar taxonomic classifications. But this would then also be a problem for the technical replicates and it isn't. I could see this happening if e.g. groups of technical replicates were sequenced separately but not the case here. Also, your demux visualizer shows no evidence of poor read quality.

My initial thought was adapter/barcode contamination on the ends of the reads. The reads comes from the sequencing provider already trimmed of adapters and barcodes, however, when I inspect the demultiplexed reads (attached) the length is 250nt for both F and R reads - would it not be expected to be shorter if the adapters have been trimmed? I do not currently have the adapter sequences to re-do this myself.

You would only expect a subset of reads to be lower than 250--those with smaller insert sizes. Many reads (depending on insert size) wouldn't have the adapter in them. The read length distribution in the allexp_paired-end-demux.qzv visualizer shows that adapter trimming did in fact happen.

I have also tried playing around with truncation lengths and trimming of the 3' end, which has actually increased the variety of ASVs. I am currently running dada2 with these parameters (just to see what happens):

Trimming is done to the 5' end and truncation to the 3' end just to be clear. This is always good idea I think, you might be able to figure out what's driving the asv variation. Trimming and truncating can also stand in for primers that slipped through the cracks (but only in a diagnostic approach).

I have also tried clustering the ASVs based on identity. This had little effect on the data.

Clustering to to percent identity of 100% isn't going to change the ASVs at all. You would have to lower this below 1 to get an effect.

EDIT (forgot to add this yesterday): It looks like the biggest problem I would want to address first is in dada2. You're only merging around 40% of reads. This is significantly lower than ideal. You could probably bump up your forward read truncation to the whole length of the read.

Hi @colinvwood - sorry for the delayed response, I've been trying to figure out what's going on with my samples.

As discussed previously, my biological replicates experience high levels of variation regarding the ASVs generated by DADA2. Technical replicates have large amounts of shared ASVs (same library, sequenced twice) but even the exact same template DNA across multiple libraries appears to generate highly different ASVs. My positive control, for instance, is the exact same template DNA and was used to generate 4 libraries across the analysis. Each technical replicate has almost the exact same set of ASVs but each library has different ASVs.

Whilst the ASVs are being labelled by DADA2 as distinctly different, the taxonomy is the exact same all the way down to species level (some will obviously be NA at species/genus level but this classification is consistent across all biological replicates, even though the ASVs are 'different').

I have experimented with both truncation of 3' and trimming of 5' ends. Initially trying to use the full F read and truncation at 235 of the R reads, I experienced a 10% increase in total ASV numbers across the dataset and a misclassification of certain taxa - this can be verified by the Zymo mock community positive control I am using across all of my experiments. Furthermore, the average length of my ASVs (refseqs) was 350 bp - my amplicon should be ~310 bp (28F, 338R primers). It's also worth noting that this did increase the number of merged reads to 60-80% in most samples though, as you predicted.

Clustering my ASVs with vsearch cluster-features-de-novo at 98% and 97% did decrease the total amount of ASVs and increased the amount shared in some samples, but the vast majority of ASVs were still shared between technical replicates only and the proportion of shared ASVs between groups remained consistent despite clustering. Interestingly, clustering did 'improve' the effect of the Decontam package in R, likely due to ASVs in the negative controls being previously distinct but now clustering with ASVs in the samples, thus resulting in their detection and removal by Decontam.

I have also modified my primers used for cutadapt to include the Illumina adapter overhand sequences that were actually used during library generation - this had almost no noticeable effect the outcome of the analysis.

Ultimately, I still don't really know what is going on here. My samples look completely consistent when taxonomy is merged at species or genus level (something that most analyses tend to do anyway and this is fine for much of my analysis) but I really wanted to see if I could trace ASVs across experiments, from inoculation until the final sampling point.

I may try a different denoising method (Deblur) as this may generate ESVs (exact sequence variants) that are consistent across samples due to less stringent error handling, although I think I'll try and find another way to trace my ASVs.

Thanks for your help so far and any other suggestion would be greatly appreciated.

Hello @rowsoft17,

Technical replicates have large amounts of shared ASVs (same library, sequenced twice) but even the exact same template DNA across multiple libraries appears to generate highly different ASVs.

If I understand correctly, the biological replicates are not the same template DNA however, they're separate extractions from different biofilms (that are albeit grown under identical conditions). Although I understand that one would thus expect similar composition, it's important to be clear about just how "replicated" the samples are.

You say that you're seeing far more shared ASVs between technical replicates than between biological replicates. To you have some numbers for the scale of this difference? And what did you use to get these numbers?

Furthermore, the average length of my ASVs (refseqs) was 350 bp - my amplicon should be ~310 bp (28F, 338R primers).

This could be concerning--you trimmed both primers with cutadapt, correct?

I have also modified my primers used for cutadapt to include the Illumina adapter overhand sequences that were actually used during library generation - this had almost no noticeable effect the outcome of the analysis.

This isn't recommendable because in a 16S workflow primers will be upstream of the adapters and trimming the former will trim the latter, and now you have to find both together to trim anything which means you'll miss primer sequences that exist by themselves.

1 Like

Hi @colinvwood,

This is correct... kind of. Most of the biological replicates are indeed as you have described - separate extractions from different biofilms, grown in the same fermenter, in the same conditions, with the same inoculating agent. However, I also have some samples (like my positive control mock community) that were extracted once and then the same template DNA was used to generate multiple libraries.

If I look at the samples from one subset of data, I have 103 biofilm samples all grown in the same fermenter and all cultivated from a shared inoculum with a total of 421 ASVs (after data trimming based on ultra low abundance in 2 or less samples). Whilst all biofilms were cultivated with the same inoculum, each set of 10 biological replicates was taken at 6 hour intervals - so biofilms should be very similar at the same time point and subtly different between time points (i.e. as time progresses different taxa shift in abundance within the biofilms, but core taxa remaining the same).

I generated a histogram of ASV frequency within the biofilms (see code below - ASVs with no presence in the subsetted data were removed with prune_taxa prior to histogram generation) and 174 of the 421 ASVs are observed in only two samples. Shared ASVs are then characterised by a large drop off as the number of samples increases (I've included this histogram).

It can be assumed that the technical replicates are the samples sharing the majority of ASVs, as when sequencing runs are separated there is very little change in total number of ASVs and the histogram shifts to samples being most frequent in only one sample.

for (i in seq_along(histo_listo)) {
  asv_tally <- rowSums(otu_table(histo_listo[[i]]) > 0)
  asv_data <- data.frame(asv_id = seq_along(asv_tally), shared_samples = asv_tally)
  samples_2 <- asv_data$shared_samples == 2
  samples_1 <- asv_data$shared_samples == 1
  
  ps_name <- names(histo_listo)[[i]]
  
  
  p <- ggplot(asv_data, aes(x = shared_samples)) +
    geom_histogram(binwidth = 1, fill = "blue", color = "blue", boundary = 0.5) +
    labs(title = paste("Frequency of ASVs Across Samples ", ps_name),
         x = "Number of Samples",
         y = "Frequency of ASVs") +
    theme_minimal()
  
  print(p)
}

When a similar plot is generated for the positive control, the number of shared ASVs is actually quite high among all samples. However, it is interesting that the mock community used as the positive control contains only 8 bacterial taxa and DADA2 generated nearly 25 ASVs.

The adapters are supposed to have been trimmed by the sequencing provider and I have trimmed both primers (F and R).

Hello @rowsoft17,

Once you take into account ASV frequency your data seem less unexpected than you think. If you were to filter low frequency ASVs you would see far greater sample overlap on average. From looking at your feature table visualizer for example, all ASVs that were found in 3 or fewer samples were only present once in each sample. You can experiment with the effect of filtering on shared ASV frequencies using the qiime feature-table filter-features command. Long story short, weighting ASVs by abundance changes the picture quite a bit here.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.