Getting ready

First we load the necessary packages.

library(dada2)
library(here)
library(data.table)

# Set seed
set.seed(1910)

Set the path to the fastq files:

path <- here("data", "raw", "casava-18-paired-end-demultiplexed")
head(list.files(path))
## [1] "AqFl2-01_S1_L001_R1_001.fastq.gz"  "AqFl2-01_S1_L001_R2_001.fastq.gz" 
## [3] "AqFl2-02_S12_L001_R1_001.fastq.gz" "AqFl2-02_S12_L001_R2_001.fastq.gz"
## [5] "AqFl2-03_S23_L001_R1_001.fastq.gz" "AqFl2-03_S23_L001_R2_001.fastq.gz"

Now we read in the names of the fastq files, and perform some string manipulation to get matched lists of the forward and reverse fastq files.

# Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq.gz and SAMPLENAME_R2_001.fastq.gz
fnFs <- sort(list.files(path, pattern="_R1_001.fastq.gz", full.names = TRUE))
fnRs <- sort(list.files(path, pattern="_R2_001.fastq.gz", full.names = TRUE))

# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq.gz
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
head(sample.names)
## [1] "AqFl2-01" "AqFl2-02" "AqFl2-03" "AqFl2-04" "AqFl2-05" "AqFl2-06"

Inspect read quality

We start by visualizing the quality profiles of the forward reads:

plotQualityProfile(fnFs[1:2])

In gray-scale is a heat map of the frequency of each quality score at each base position. The median quality score at each position is shown by the green line, and the quartiles of the quality score distribution by the orange lines. The red line shows the scaled proportion of reads that extend to at least that position (this is more useful for other sequencing technologies, as Illumina reads are typically all the same lenghth, hence the flat red line).

The forward reads are good quality. We generally trim the last few nucleotides to avoid less well-controlled errors that can arise there. These quality profiles do not suggest that any additional trimming is needed. We will truncate the forward reads at position 290 (trimming the last 10 nucleotides).

Now we visualize the quality profile of the reverse reads:

plotQualityProfile(fnRs[1:2])

The reverse reads are of significantly worse quality, especially at the end, which is common in Illumina sequencing. This isn’t too worrisome, as DADA2 incorporates quality information into its error model which makes the algorithm robust to lower quality sequence, but trimming as the average qualities crash will improve the algorithm’s sensitivity to rare sequence variants. Based on these profiles, we will truncate the reverse reads at position 248 where the quality distribution crashes.

Considerations: the reads must still overlap after truncation in order to merge them later! For less-overlapping primer sets, like the V1-V2 we used for the present study, the truncLen must be large enough to maintain 20 + biological.length.variation nucleotides of overlap between them.

Filter and Trim

Assign filenames for the filtered fastq files.

# Place filtered files in filtered/ subdirectory
filtFs <- file.path(path, "filtered", paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(path, "filtered", paste0(sample.names, "_R_filt.fastq.gz"))
names(filtFs) <- sample.names
names(filtRs) <- sample.names

We’ll use standard filtering parameters: maxN=0 (DADA2 requires no Ns), truncQ=2, rm.phix=TRUE and maxEE=2. The maxEE parameter sets the maximum number of “expected errors” allowed in a read, which is a better filter than simply averaging quality scores. We’ll also trim off the primer sequence from the forward and reverse reads by setting trimLeft=c(20, 18). Trimming and filtering is performed on paired reads jointly, i.e. both reads must pass the filter for the pair to pass.

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft=c(20, 18),
                     truncLen=c(290,248), maxN=0, maxEE=c(2,2), truncQ=2, 
                     rm.phix=TRUE, compress=TRUE, multithread=TRUE) 
head(out)
##                                   reads.in reads.out
## AqFl2-01_S1_L001_R1_001.fastq.gz    161307    116840
## AqFl2-02_S12_L001_R1_001.fastq.gz   166913    137326
## AqFl2-03_S23_L001_R1_001.fastq.gz   189889    155067
## AqFl2-04_S34_L001_R1_001.fastq.gz   108284     81544
## AqFl2-05_S45_L001_R1_001.fastq.gz   174292    135843
## AqFl2-06_S56_L001_R1_001.fastq.gz   182618    147361

Considerations: The standard filtering parameters are starting points, not set in stone. If speeding up downstream computation is needed, consider tightening maxEE. If too few reads are passing the filter, consider relaxing maxEE, perhaps especially on the reverse reads (eg. maxEE=c(2,5)), and reducing the truncLen to remove low quality tails. Remember though, when choosing truncLen for paired-end reads we must maintain overlap after truncation in order to merge them later.

Learn the error rates

The DADA2 algorithm makes use of a parametric error model (err) and every amplicon dataset has a different set of error rates. The learnErrors method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution. As in many machine-learning problems, the algorithm must begin with an initial guess, for which the maximum possible error rates in this data are used (the error rates if only the most abundant sequence is correct and all the rest are errors).

errF <- learnErrors(filtFs, multithread=TRUE)
## 110492910 total bases in 409233 reads from 3 samples will be used for learning the error rates.
errR <- learnErrors(filtRs, multithread=TRUE)
## 112878710 total bases in 490777 reads from 4 samples will be used for learning the error rates.

It is always worthwhile, as a sanity check if nothing else, to visualize the estimated error rates:

plotErrors(errF, nominalQ=TRUE)

plotErrors(errR, nominalQ=TRUE)

The error rates for each possible transition (A→C, A→G, …) are shown. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence of the machine-learning algorithm. The red line shows the error rates expected under the nominal definition of the Q-score. Here the estimated error rates (black line) are a good fit to the observed rates (points), and the error rates drop with increased quality as expected. Everything looks reasonable and we proceed with confidence.

Sample inference

We are now ready to apply the core sample inference algorithm to the data. By default, the dada function processes each sample independently. However, pooling information across samples can increase sensitivity to sequence variants that may be present at very low frequencies in multiple samples. Here, we set pool=TRUE to retain the singletons for the downstream species richness estimation.

dadaFs <- dada(filtFs, err=errF, pool=TRUE, multithread=TRUE) 
## 80 samples were pooled: 8357936 reads in 375761 unique sequences.
dadaRs <- dada(filtRs, err=errR, pool=TRUE, multithread=TRUE)
## 80 samples were pooled: 8357936 reads in 521582 unique sequences.

Inspect the dada-class object

dadaFs[[1]]
## dada-class: object describing DADA2 denoising results
## 2741 sequence variants were inferred from 19881 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

Merge paired reads

We now merge the forward and reverse reads together to obtain the full denoised sequences. Merging is performed by aligning the denoised forward reads with the reverse-complement of the corresponding denoised reverse reads, and then constructing the merged “contig” sequences. By default, merged sequences are only output if the forward and reverse reads overlap by at least 12 bases, and are identical to each other in the overlap region (but these conditions can be changed via function arguments).

mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose=TRUE)

# Inspect the merger data.frame from the first sample
head(mergers[[1]])
##                                                                                                                                                                                                                                                                                                                                   sequence
## 1                         ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGTAGAGAGAAGCTTGCTTCTCTTGAGAGCGGCGGACGGGTGAGTAATGCCTAGGAATCTGCCTGGTAGTGGGGGATAACGTTCGGAAACGGACGCTAATACCGCATACGTCCTACGGGAGAAAGCAGGGGACCTTCGGGCCTTGCGCTATCAGATGAGCCTAGGTCGGATTAGCTAGTTGGTGAGGTAATGGCTCACCAAGGCGACGATCCGTAACTGGTCTGAGAGGATGATCAGTCACACTGGAACTGAGACACGGTCCAG
## 2 GATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAACGCTGCTTTTTCCACCGAAGCTTGCTTCACCGGAAAAAGCGGAGTGGCGGACGGGTGAGTAACACGTGGGTAACCTGCCCTTCAGCGGGGGATAACACTTGGAAACAGGTGCTAATACCGCATAGGATTTCTGTTCGCATGAACGGAGAAGGAAAGACGGCGTAAGCTGTCACTGAAGGATGGACCCGCGGTGCATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCAATGATGCATAGCCGACCTGAGAGGGTAATCGGCCACACTGGGACTGAGACACGGCCCAG
## 3                             GATTAACGCTAGCGGGATGCCTAATACATGCAAGTCGAACGAAGTATTTATACTTAGTGGCGAACGGGTGAGTAACACGTATCCAACATACCCATAAGAGGGGGATAACTAGTTGAAAAACTAGCTAATACCCCATAGCATATTTTTACATAAGTATTAATATTTAAAGTTGCGTTTGCAACACTTTTGGATTGGGGTGCGACGTATTAGATTGTTGGTGAGGTAGTGGCTCACCAAGTCGATGATGCGTAGCTGTGCTGAGAGGCAGAACAGCCACAATGGGACTGAGACACGGCCCAT
## 4                     GACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAAAGGCCTGCAGCTTGCTGCGGGTACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGATCTGCCCTCAACTTTGGGATAAGCCTGGGAAACTGGGTCTAATACCGGATATGACCTTTCTTTAGTGTGGAAGGTGGAAAGCTTTTGCGGTTGGGGATGAGCTCGCGGCCTATCAGCTTGTTGGTGGGGTAATGGCCTACCAAGGCGTCGACGGGTAGCCGGCCTGAGAGGGTGTACGGCCACATTGGGACTGAGATACGGCCCAG
## 5                             GATTAACGCTAGCGGGATGCCTAATACATGCAAGTCGAACGAAGTATTTATACTTAGTGGCGAACGGGTGAGTAACACGTATCCAACATACCCATAAGAGGGGGATAACTAGTTGAAAAACTAGCTAATACCCCATAGCATATTTTTACATAAGTATTGATATTTAAAGTTGCGTTTGCAACACTTTTGGATTGGGGTGCGACGTATTAGATTGTTGGTGAGGTAATGGCTCACCAAGTCGATGATGCGTAGCTGTGCTGAGAGGCAGAACAGCCACAATGGGACTGAGACACGGCCCAT
## 6                    GACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGACCGGCAGAGCTTGCTCTGTTTAGGTTAGTGGCGAACGGGTGAGTAACACGTGAGTAACCTGCCCCCTTCTTTGGGATAAGCACCCGAAAGGGTGTCTAATACCGGATATTCTGCTGAGGCCGCATGGTTTTGGTTGGAAAGGGTTTCTGGTGGGGGATGGGCTCGCGGCTTATCAGCTTGTTGGTGGGGTAATGGCTTACCAAGGCTTTGACGGGTAACCGGCCTGAGAGGGTGACCGGTCACATTGGGACTGAGATACGGCCCAG
##   abundance forward reverse nmatch nmismatch nindel prefer accept
## 1     14855       1       1    196         0      0      1   TRUE
## 2     10126       6       6    172         0      0      1   TRUE
## 3      6499       4       4    200         0      0      1   TRUE
## 4      6031      10      10    192         0      0      1   TRUE
## 5      3073      13      15    200         0      0      1   TRUE
## 6      2629      17      18    191         0      0      1   TRUE

Considerations: Most of reads should successfully merge. If that is not the case, upstream parameters may need to be revisited: Did we trim away the overlap between the reads?

Construct sequence table

We can now construct an amplicon sequence variant table (ASV) table, a higher-resolution version of the OTU table produced by traditional methods.

seqtab <- makeSequenceTable(mergers)

The sequence table is a matrix with rows corresponding to (and named by) the samples, and columns corresponding to (and named by) the sequence variants.

dim(seqtab)
## [1]   80 7590

Inspect distribution of sequence lengths

table(nchar(getSequences(seqtab)))
## 
## 270 272 273 274 275 276 277 278 279 280 281 282 283 284 285 287 288 289 290 291 292 293 294 295 296 
##  17   1   2   2   2  16   2  31  22 240   3   2   1   1   1   6   2   1   8  10  10  23   7  20  12 
## 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 
##  31  13  20  80  25  67  43 397 102 175 191 322 296 468 122 219  87 105 127 291 375 547 162 121 157 
## 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 
##  52  61  72  17 251 124 551 108 190 100 185  46  29  59  91  48   7   1   9  16  78 355   8  57   4 
## 348 354 357 358 360 361 362 369 370 374 379 380 382 384 385 389 390 392 393 399 400 401 404 405 407 
##   1   1   1   1   2   2   1   1   5   1   2   1   1   1   1   5   5   2   1   1   2   1   1   1   1 
## 409 410 412 413 418 420 425 426 427 428 431 434 435 437 438 439 441 447 451 452 467 469 470 476 480 
##   1   1   1   1   1   2   1   1   2   2   1   1   9   1   1   4   1   1   1   1   2   1   1   1   1 
## 482 484 488 
##   2   1   1

Considerations: Sequences that are much longer or shorter than expected may be the result of non-specific priming. The sequence lengths fall within the range of the expected amplicon sizes, most of which fall betwen 263 and 367 with a long tail on the right. We’ll just leave them as they are.

Remove chimeras

The core dada method corrects substitution and indel errors, but chimeras remain. Fortunately, the accuracy of sequence variants after denoising makes identifying chimeric ASVs simpler than when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant “parent” sequences. As we used pool=TRUE during the sample inference, we should use method="pooled" for the chimera removal.

seqtab.nochim <- removeBimeraDenovo(seqtab, method="pooled", multithread=TRUE, verbose=TRUE)
sum(seqtab.nochim)/sum(seqtab)
## [1] 0.9539282

The frequency of chimeric sequences varies substantially from dataset to dataset, and depends on on factors including experimental procedures and sample complexity.

Considerations: Most of the reads should remain after chimera removal (it is not uncommon for a majority of sequence variants to be removed though). If most of the reads were removed as chimeric, upstream processing may need to be revisited. In almost all cases this is caused by primer sequences with ambiguous nucleotides that were not removed prior to beginning the DADA2 pipeline.

Track reads through the pipeline

As a final check of our progress, we’ll look at the number of reads that made it through each step in the pipeline:

getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
DT::datatable(track)

Looks good! We kept the majority of our raw reads, and there is no over-large drop associated with any single step.

Considerations: This is a great place to do a last sanity check. Outside of filtering, there should no step in which a majority of reads are lost. If a majority of reads failed to merge, one may need to revisit the truncLen parameter used in the filtering step and make sure that the truncated reads span the amplicon. If a majority of reads were removed as chimeric, one may need to revisit the removal of primers, as the ambiguous nucleotides in unremoved primers interfere with chimera identification.

Export data to QIIME2

Export feature table:

write.table(t(seqtab.nochim), here("data", "dada2", "seqtab-nochim.txt"), 
            sep="\t", row.names=TRUE, col.names=NA, quote=FALSE)

Export representative sequences:

uniquesToFasta(seqtab.nochim, fout=here("data", "dada2", "rep-seqs.fna"), 
               ids=colnames(seqtab.nochim))

Acknowledgements

The processing of raw sequence data into an ASV table is based on the online DADA2 tutorial (1.12). For more documentations and tutorials, visit the DADA2 website.

Session information

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/libblas/libblas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=nb_NO.UTF-8       
##  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=nb_NO.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=nb_NO.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=nb_NO.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.12.2 here_0.1          dada2_1.12.1      Rcpp_1.0.2        knitr_1.24       
## [6] BiocStyle_2.12.0 
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.44.0              jsonlite_1.6                RcppParallel_4.4.3         
##  [4] shiny_1.3.2                 assertthat_0.2.1            BiocManager_1.30.4         
##  [7] stats4_3.6.1                latticeExtra_0.6-28         GenomeInfoDbData_1.2.1     
## [10] Rsamtools_2.0.0             yaml_2.2.0                  pillar_1.4.2               
## [13] backports_1.1.4             lattice_0.20-38             glue_1.3.1                 
## [16] digest_0.6.20               GenomicRanges_1.36.0        RColorBrewer_1.1-2         
## [19] promises_1.0.1              XVector_0.24.0              colorspace_1.4-1           
## [22] htmltools_0.3.6             httpuv_1.5.1                Matrix_1.2-17              
## [25] plyr_1.8.4                  pkgconfig_2.0.2             ShortRead_1.42.0           
## [28] bookdown_0.12               zlibbioc_1.30.0             xtable_1.8-4               
## [31] purrr_0.3.2                 scales_1.0.0                later_0.8.0                
## [34] BiocParallel_1.18.1         tibble_2.1.3                IRanges_2.18.1             
## [37] ggplot2_3.2.1               DT_0.8                      SummarizedExperiment_1.14.1
## [40] BiocGenerics_0.30.0         lazyeval_0.2.2              magrittr_1.5               
## [43] crayon_1.3.4                mime_0.7                    evaluate_0.14              
## [46] hwriter_1.3.2               tools_3.6.1                 matrixStats_0.54.0         
## [49] stringr_1.4.0               S4Vectors_0.22.0            munsell_0.5.0              
## [52] DelayedArray_0.10.0         Biostrings_2.52.0           compiler_3.6.1             
## [55] GenomeInfoDb_1.20.0         rlang_0.4.0                 grid_3.6.1                 
## [58] RCurl_1.95-4.12             rstudioapi_0.10             htmlwidgets_1.3            
## [61] crosstalk_1.0.0             bitops_1.0-6                rmarkdown_1.14             
## [64] gtable_0.3.0                codetools_0.2-16            reshape2_1.4.3             
## [67] R6_2.4.0                    GenomicAlignments_1.20.1    dplyr_0.8.3                
## [70] rprojroot_1.3-2             stringi_1.4.3               parallel_3.6.1             
## [73] tidyselect_0.2.5            xfun_0.8