CSS normalization of sequence counts with metagenomeSeq in R

All R code:

TUTORIAL ON CSS normalization with metagenomeSeq

BY: Carly Muletz Wolz
Date: May 24, 2018

install metagenomeSeq, had some issues here, finally worked

make sure all packages are up to date

source(“https://bioconductor.org/biocLite.R”)
biocLite(“metagenomeSeq”)
biocLite(c(“GenomicFeatures”, “AnnotationDbi”))

library(metagenomeSeq)

I read in all of my files separately into phyloseq. It seems to work the best. Check out this new tool to import the .qza files into R: Tutorial: Integrating QIIME2 and R for data visualization and analysis using qiime2R. With the tutorial, I am using a dataset already in phyloseq that is merged. This is extra code to show how to bring in each separate piece with your own data

BIOM FILE, may or may not need the parseFunction

biom <- import_biom(“otu_tables/Ch3_noneg.biom”, parseFunction = parse_taxonomy_greengenes)

MAPPING FILE

map <- import_qiime_sample_data(“Mapping_Ch3_final_selDays.txt”)

TREE FILE

library(ape)
tree = read.tree(“Ch3_all_final_rooted.tre”)

Create phyloseq Object

ch3 <- merge_phyloseq(biom, tree, map)

Remove residual taxa that don’t have any sequences

sum(taxa_sums(ch3) == 0)
ch3 <- filter_taxa(ch3, function(x) sum(x) != 0, TRUE)
sum(taxa_sums(ch3) == 0)

CODE for CSS normalization using preloaded data

################# READ in preloaded phyloseq object data GlobalPatterns ########
library(phyloseq)
?GlobalPatterns

data(GlobalPatterns)

sort(sample_sums(GlobalPatterns))

40x difference in sequencing depth between samples. CSS normalization is not going to ‘fix’ this much difference. In this case I would recommend rarefaction, but we will move forward since this is the example dataset in phyloseq

max(sample_sums(GlobalPatterns))/min(sample_sums(GlobalPatterns))

Convert file types to use in metagenomeSeq

MGS <- phyloseq_to_metagenomeSeq(GlobalPatterns)

MGS

Perform normalization following vignette https://bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/inst/doc/metagenomeSeq.pdf

p <- cumNormStatFast(MGS)
p
MGS <- cumNorm(MGS, p =p)

returns normalized factors for each sample

normFactors(MGS)

To export normalized count matrix

normmybiom <- MRcounts(MGS, norm = T)

Text file of normalized OTU table that you can look at, but we want the biom file for future R analyses

exportMat(normmybiom, file = “GP_norm.txt”, sep = “\t”)

biom file

b <- MRexperiment2biom(MGS, norm = T)

library(biomformat)

write_biom(b, biom_file = “GP_norm.biom”)

library(RCurl)

import_biom2_script <- getURL(“https://gist.githubusercontent.com/jnpaulson/324ac1fa3eab1bc7f845/raw/2ef62334d4e9bc5446a5ee6dd198f52484097dae/import_biom2.R”, ssl.verifypeer = FALSE)

eval(parse(text = import_biom2_script))

BIOM FILE

biom <- read_biom(“GP_norm.biom”)
biom2 <- import_biom2(biom)

sort(sample_sums(biom2))

This really did nothing to fix the variation in sampling depth. I find CSS will sometimes decrease the fold-difference in sampling depth but not always. Your choice on whether to use. Maybe if it increases the fold-difference, don’t use…but I have nothing to back that up. I use these packages, and have not helped develop them

max(sample_sums(biom2))/min(sample_sums(biom2))

MAPPING FILE, Your mapping file as above

map <- import_qiime_sample_data(“Mapping_Ch3_final.txt”)

TREE FILE, Your tree file as above

tree = read.tree(“Ch3_all_final_rooted.tre”)
ch3 <- merge_phyloseq(biom2, tree, map)
ch3
6 Likes

Thanks a lot for the tutorial.

Is it normal that after CSS normalization, the difference in sequencing depth between samples get increased?

Thank you for asking. I have been meaning to ask the metagenomeSeq folks about this. I just posted on github. Hopefully, we get a response.

2 Likes

I am not sure why you would recommend rarefaction when there is 40x difference?

thanks,
Xiao