First we load the necessary packages.
library(dada2)
library(here)
library(data.table)
# Set seed
set.seed(1910)
Set the path to the fastq files:
path <- here("data", "raw", "casava-18-paired-end-demultiplexed")
head(list.files(path))
## [1] "AqFl2-01_S1_L001_R1_001.fastq.gz" "AqFl2-01_S1_L001_R2_001.fastq.gz"
## [3] "AqFl2-02_S12_L001_R1_001.fastq.gz" "AqFl2-02_S12_L001_R2_001.fastq.gz"
## [5] "AqFl2-03_S23_L001_R1_001.fastq.gz" "AqFl2-03_S23_L001_R2_001.fastq.gz"
Now we read in the names of the fastq files, and perform some string manipulation to get matched lists of the forward and reverse fastq files.
# Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq.gz and SAMPLENAME_R2_001.fastq.gz
fnFs <- sort(list.files(path, pattern="_R1_001.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern="_R2_001.fastq.gz", full.names = TRUE))
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq.gz
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
head(sample.names)
## [1] "AqFl2-01" "AqFl2-02" "AqFl2-03" "AqFl2-04" "AqFl2-05" "AqFl2-06"
We start by visualizing the quality profiles of the forward reads:
plotQualityProfile(fnFs[1:2])
In gray-scale is a heat map of the frequency of each quality score at each base position. The median quality score at each position is shown by the green line, and the quartiles of the quality score distribution by the orange lines. The red line shows the scaled proportion of reads that extend to at least that position (this is more useful for other sequencing technologies, as Illumina reads are typically all the same lenghth, hence the flat red line).
The forward reads are good quality. We generally trim the last few nucleotides to avoid less well-controlled errors that can arise there. These quality profiles do not suggest that any additional trimming is needed. We will truncate the forward reads at position 290 (trimming the last 10 nucleotides).
Now we visualize the quality profile of the reverse reads:
plotQualityProfile(fnRs[1:2])
The reverse reads are of significantly worse quality, especially at the end, which is common in Illumina sequencing. This isn’t too worrisome, as DADA2 incorporates quality information into its error model which makes the algorithm robust to lower quality sequence, but trimming as the average qualities crash will improve the algorithm’s sensitivity to rare sequence variants. Based on these profiles, we will truncate the reverse reads at position 248 where the quality distribution crashes.
Considerations: the reads must still overlap after truncation in order to merge them later! For less-overlapping primer sets, like the V1-V2 we used for the present study, the truncLen
must be large enough to maintain 20 + biological.length.variation
nucleotides of overlap between them.
Assign filenames for the filtered fastq files.
# Place filtered files in filtered/ subdirectory
filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
names(filtFs) <- sample.names
names(filtRs) <- sample.names
We’ll use standard filtering parameters: maxN=0
(DADA2 requires no Ns), truncQ=2
, rm.phix=TRUE
and maxEE=2
. The maxEE
parameter sets the maximum number of “expected errors” allowed in a read, which is a better filter than simply averaging quality scores. We’ll also trim off the primer sequence from the forward and reverse reads by setting trimLeft=c(20, 18)
. Trimming and filtering is performed on paired reads jointly, i.e. both reads must pass the filter for the pair to pass.
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft=c(20, 18),
truncLen=c(290,248), maxN=0, maxEE=c(2,2), truncQ=2,
rm.phix=TRUE, compress=TRUE, multithread=TRUE)
head(out)
## reads.in reads.out
## AqFl2-01_S1_L001_R1_001.fastq.gz 161307 116840
## AqFl2-02_S12_L001_R1_001.fastq.gz 166913 137326
## AqFl2-03_S23_L001_R1_001.fastq.gz 189889 155067
## AqFl2-04_S34_L001_R1_001.fastq.gz 108284 81544
## AqFl2-05_S45_L001_R1_001.fastq.gz 174292 135843
## AqFl2-06_S56_L001_R1_001.fastq.gz 182618 147361
Considerations: The standard filtering parameters are starting points, not set in stone. If speeding up downstream computation is needed, consider tightening maxEE
. If too few reads are passing the filter, consider relaxing maxEE
, perhaps especially on the reverse reads (eg. maxEE=c(2,5)
), and reducing the truncLen
to remove low quality tails. Remember though, when choosing truncLen
for paired-end reads we must maintain overlap after truncation in order to merge them later.
The DADA2 algorithm makes use of a parametric error model (err
) and every amplicon dataset has a different set of error rates. The learnErrors
method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).
errF <- learnErrors(filtFs, multithread=TRUE)
## 110492910 total bases in 409233 reads from 3 samples will be used for learning the error rates.
errR <- learnErrors(filtRs, multithread=TRUE)
## 112878710 total bases in 490777 reads from 4 samples will be used for learning the error rates.
It is always worthwhile, as a sanity check if nothing else, to visualize the estimated error rates:
plotErrors(errF, nominalQ=TRUE)
plotErrors(errR, nominalQ=TRUE)
The error rates for each possible transition (A→C, A→G, …) are shown. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence of the machine-learning algorithm. The red line shows the error rates expected under the nominal definition of the Q-score. Here the estimated error rates (black line) are a good fit to the observed rates (points), and the error rates drop with increased quality as expected. Everything looks reasonable and we proceed with confidence.
We are now ready to apply the core sample inference algorithm to the data. By default, the dada
function processes each sample independently. However, pooling information across samples can increase sensitivity to sequence variants that may be present at very low frequencies in multiple samples. Here, we set pool=TRUE
to retain the singletons for the downstream species richness estimation.
dadaFs <- dada(filtFs, err=errF, pool=TRUE, multithread=TRUE)
## 80 samples were pooled: 8357936 reads in 375761 unique sequences.
dadaRs <- dada(filtRs, err=errR, pool=TRUE, multithread=TRUE)
## 80 samples were pooled: 8357936 reads in 521582 unique sequences.
Inspect the dada-class
object
dadaFs[[1]]
## dada-class: object describing DADA2 denoising results
## 2741 sequence variants were inferred from 19881 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
We now merge the forward and reverse reads together to obtain the full denoised sequences. Merging is performed by aligning the denoised forward reads with the reverse-complement of the corresponding denoised reverse reads, and then constructing the merged “contig” sequences. By default, merged sequences are only output if the forward and reverse reads overlap by at least 12 bases, and are identical to each other in the overlap region (but these conditions can be changed via function arguments).
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)
# Inspect the merger data.frame from the first sample
head(mergers[[1]])
## sequence
## 1 ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTCTCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAACGTTCGGAAACGGACGCTAATACCGCATACGTCCTACGGGAGAAAGCAGGGGACCTTCGGGCCTTGCGCTATCAGATGAGCCTAGGTCGGATTAGCTAGTTGGTGAGGTAATGGCTCACCAAGGCGACGATCCGTAACTGGTCTGAGAGGATGATCAGTCACACTGGAACTGAGACACGGTCCAG
## 2 GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAACGCTGCTTTTTCCACCGAAGCTTGCTTCACCGGAAAAAGCGGAGTGGCGGACGGGTGAGTAACACGTGGGTAACCTGCCCTTCAGCGGGGGATAACACTTGGAAACAGGTGCTAATACCGCATAGGATTTCTGTTCGCATGAACGGAGAAGGAAAGACGGCGTAAGCTGTCACTGAAGGATGGACCCGCGGTGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAATGATGCATAGCCGACCTGAGAGGGTAATCGGCCACACTGGGACTGAGACACGGCCCAG
## 3 GATTAACGCTAGCGGGATGCCTAATACATGCAAGTCGAACGAAGTATTTATACTTAGTGGCGAACGGGTGAGTAACACGTATCCAACATACCCATAAGAGGGGGATAACTAGTTGAAAAACTAGCTAATACCCCATAGCATATTTTTACATAAGTATTAATATTTAAAGTTGCGTTTGCAACACTTTTGGATTGGGGTGCGACGTATTAGATTGTTGGTGAGGTAGTGGCTCACCAAGTCGATGATGCGTAGCTGTGCTGAGAGGCAGAACAGCCACAATGGGACTGAGACACGGCCCAT
## 4 GACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAAAGGCCTGCAGCTTGCTGCGGGTACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGATCTGCCCTCAACTTTGGGATAAGCCTGGGAAACTGGGTCTAATACCGGATATGACCTTTCTTTAGTGTGGAAGGTGGAAAGCTTTTGCGGTTGGGGATGAGCTCGCGGCCTATCAGCTTGTTGGTGGGGTAATGGCCTACCAAGGCGTCGACGGGTAGCCGGCCTGAGAGGGTGTACGGCCACATTGGGACTGAGATACGGCCCAG
## 5 GATTAACGCTAGCGGGATGCCTAATACATGCAAGTCGAACGAAGTATTTATACTTAGTGGCGAACGGGTGAGTAACACGTATCCAACATACCCATAAGAGGGGGATAACTAGTTGAAAAACTAGCTAATACCCCATAGCATATTTTTACATAAGTATTGATATTTAAAGTTGCGTTTGCAACACTTTTGGATTGGGGTGCGACGTATTAGATTGTTGGTGAGGTAATGGCTCACCAAGTCGATGATGCGTAGCTGTGCTGAGAGGCAGAACAGCCACAATGGGACTGAGACACGGCCCAT
## 6 GACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGACCGGCAGAGCTTGCTCTGTTTAGGTTAGTGGCGAACGGGTGAGTAACACGTGAGTAACCTGCCCCCTTCTTTGGGATAAGCACCCGAAAGGGTGTCTAATACCGGATATTCTGCTGAGGCCGCATGGTTTTGGTTGGAAAGGGTTTCTGGTGGGGGATGGGCTCGCGGCTTATCAGCTTGTTGGTGGGGTAATGGCTTACCAAGGCTTTGACGGGTAACCGGCCTGAGAGGGTGACCGGTCACATTGGGACTGAGATACGGCCCAG
## abundance forward reverse nmatch nmismatch nindel prefer accept
## 1 14855 1 1 196 0 0 1 TRUE
## 2 10126 6 6 172 0 0 1 TRUE
## 3 6499 4 4 200 0 0 1 TRUE
## 4 6031 10 10 192 0 0 1 TRUE
## 5 3073 13 15 200 0 0 1 TRUE
## 6 2629 17 18 191 0 0 1 TRUE
Considerations: Most of reads should successfully merge. If that is not the case, upstream parameters may need to be revisited: Did we trim away the overlap between the reads?
We can now construct an amplicon sequence variant table (ASV) table, a higher-resolution version of the OTU table produced by traditional methods.
seqtab <- makeSequenceTable(mergers)
The sequence table is a matrix
with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants.
dim(seqtab)
## [1] 80 7590
Inspect distribution of sequence lengths
table(nchar(getSequences(seqtab)))
##
## 270 272 273 274 275 276 277 278 279 280 281 282 283 284 285 287 288 289 290 291 292 293 294 295 296
## 17 1 2 2 2 16 2 31 22 240 3 2 1 1 1 6 2 1 8 10 10 23 7 20 12
## 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321
## 31 13 20 80 25 67 43 397 102 175 191 322 296 468 122 219 87 105 127 291 375 547 162 121 157
## 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346
## 52 61 72 17 251 124 551 108 190 100 185 46 29 59 91 48 7 1 9 16 78 355 8 57 4
## 348 354 357 358 360 361 362 369 370 374 379 380 382 384 385 389 390 392 393 399 400 401 404 405 407
## 1 1 1 1 2 2 1 1 5 1 2 1 1 1 1 5 5 2 1 1 2 1 1 1 1
## 409 410 412 413 418 420 425 426 427 428 431 434 435 437 438 439 441 447 451 452 467 469 470 476 480
## 1 1 1 1 1 2 1 1 2 2 1 1 9 1 1 4 1 1 1 1 2 1 1 1 1
## 482 484 488
## 2 1 1
Considerations: Sequences that are much longer or shorter than expected may be the result of non-specific priming. The sequence lengths fall within the range of the expected amplicon sizes, most of which fall betwen 263 and 367 with a long tail on the right. We’ll just leave them as they are.
The core dada
method corrects substitution and indel errors, but chimeras remain. Fortunately, the accuracy of sequence variants after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant “parent” sequences. As we used pool=TRUE
during the sample inference, we should use method="pooled"
for the chimera removal.
seqtab.nochim <- removeBimeraDenovo(seqtab, method="pooled", multithread=TRUE, verbose=TRUE)
sum(seqtab.nochim)/sum(seqtab)
## [1] 0.9539282
The frequency of chimeric sequences varies substantially from dataset to dataset, and depends on on factors including experimental procedures and sample complexity.
Considerations: Most of the reads should remain after chimera removal (it is not uncommon for a majority of sequence variants to be removed though). If most of the reads were removed as chimeric, upstream processing may need to be revisited. In almost all cases this is caused by primer sequences with ambiguous nucleotides that were not removed prior to beginning the DADA2 pipeline.
As a final check of our progress, we’ll look at the number of reads that made it through each step in the pipeline:
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
DT::datatable(track)
Looks good! We kept the majority of our raw reads, and there is no over-large drop associated with any single step.
Considerations: This is a great place to do a last sanity check. Outside of filtering, there should no step in which a majority of reads are lost. If a majority of reads failed to merge, one may need to revisit the truncLen
parameter used in the filtering step and make sure that the truncated reads span the amplicon. If a majority of reads were removed as chimeric, one may need to revisit the removal of primers, as the ambiguous nucleotides in unremoved primers interfere with chimera identification.
Export feature table:
write.table(t(seqtab.nochim), here("data", "dada2", "seqtab-nochim.txt"),
sep="\t", row.names=TRUE, col.names=NA, quote=FALSE)
Export representative sequences:
uniquesToFasta(seqtab.nochim, fout=here("data", "dada2", "rep-seqs.fna"),
ids=colnames(seqtab.nochim))
The processing of raw sequence data into an ASV table is based on the online DADA2 tutorial (1.12). For more documentations and tutorials, visit the DADA2 website.
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=nb_NO.UTF-8
## [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=nb_NO.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=nb_NO.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=nb_NO.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.12.2 here_0.1 dada2_1.12.1 Rcpp_1.0.2 knitr_1.24
## [6] BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] Biobase_2.44.0 jsonlite_1.6 RcppParallel_4.4.3
## [4] shiny_1.3.2 assertthat_0.2.1 BiocManager_1.30.4
## [7] stats4_3.6.1 latticeExtra_0.6-28 GenomeInfoDbData_1.2.1
## [10] Rsamtools_2.0.0 yaml_2.2.0 pillar_1.4.2
## [13] backports_1.1.4 lattice_0.20-38 glue_1.3.1
## [16] digest_0.6.20 GenomicRanges_1.36.0 RColorBrewer_1.1-2
## [19] promises_1.0.1 XVector_0.24.0 colorspace_1.4-1
## [22] htmltools_0.3.6 httpuv_1.5.1 Matrix_1.2-17
## [25] plyr_1.8.4 pkgconfig_2.0.2 ShortRead_1.42.0
## [28] bookdown_0.12 zlibbioc_1.30.0 xtable_1.8-4
## [31] purrr_0.3.2 scales_1.0.0 later_0.8.0
## [34] BiocParallel_1.18.1 tibble_2.1.3 IRanges_2.18.1
## [37] ggplot2_3.2.1 DT_0.8 SummarizedExperiment_1.14.1
## [40] BiocGenerics_0.30.0 lazyeval_0.2.2 magrittr_1.5
## [43] crayon_1.3.4 mime_0.7 evaluate_0.14
## [46] hwriter_1.3.2 tools_3.6.1 matrixStats_0.54.0
## [49] stringr_1.4.0 S4Vectors_0.22.0 munsell_0.5.0
## [52] DelayedArray_0.10.0 Biostrings_2.52.0 compiler_3.6.1
## [55] GenomeInfoDb_1.20.0 rlang_0.4.0 grid_3.6.1
## [58] RCurl_1.95-4.12 rstudioapi_0.10 htmlwidgets_1.3
## [61] crosstalk_1.0.0 bitops_1.0-6 rmarkdown_1.14
## [64] gtable_0.3.0 codetools_0.2-16 reshape2_1.4.3
## [67] R6_2.4.0 GenomicAlignments_1.20.1 dplyr_0.8.3
## [70] rprojroot_1.3-2 stringi_1.4.3 parallel_3.6.1
## [73] tidyselect_0.2.5 xfun_0.8