All R code:
TUTORIAL ON CSS normalization with metagenomeSeq
BY: Carly Muletz Wolz
Date: May 24, 2018
install metagenomeSeq, had some issues here, finally worked
make sure all packages are up to date
source("https://bioconductor.org/biocLite.R")
biocLite("metagenomeSeq")
biocLite(c("GenomicFeatures", "AnnotationDbi"))
library(metagenomeSeq)
I read in all of my files separately into phyloseq. It seems to work the best. Check out this new tool to import the .qza files into R: Tutorial: Integrating QIIME2 and R for data visualization and analysis using qiime2R. With the tutorial, I am using a dataset already in phyloseq that is merged. This is extra code to show how to bring in each separate piece with your own data
BIOM FILE, may or may not need the parseFunction
biom <- import_biom("otu_tables/Ch3_noneg.biom", parseFunction = parse_taxonomy_greengenes)
MAPPING FILE
map <- import_qiime_sample_data("Mapping_Ch3_final_selDays.txt")
TREE FILE
library(ape)
tree = read.tree("Ch3_all_final_rooted.tre")
Create phyloseq Object
ch3 <- merge_phyloseq(biom, tree, map)
Remove residual taxa that don't have any sequences
sum(taxa_sums(ch3) == 0)
ch3 <- filter_taxa(ch3, function(x) sum(x) != 0, TRUE)
sum(taxa_sums(ch3) == 0)
CODE for CSS normalization using preloaded data
################# READ in preloaded phyloseq object data GlobalPatterns ########
library(phyloseq)
?GlobalPatterns
data(GlobalPatterns)
sort(sample_sums(GlobalPatterns))
40x difference in sequencing depth between samples. CSS normalization is not going to 'fix' this much difference. In this case I would recommend rarefaction, but we will move forward since this is the example dataset in phyloseq
max(sample_sums(GlobalPatterns))/min(sample_sums(GlobalPatterns))
Convert file types to use in metagenomeSeq
MGS <- phyloseq_to_metagenomeSeq(GlobalPatterns)
MGS
Perform normalization following vignette https://bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/inst/doc/metagenomeSeq.pdf
p <- cumNormStatFast(MGS)
p
MGS <- cumNorm(MGS, p =p)
returns normalized factors for each sample
normFactors(MGS)
To export normalized count matrix
normmybiom <- MRcounts(MGS, norm = T)
Text file of normalized OTU table that you can look at, but we want the biom file for future R analyses
exportMat(normmybiom, file = "GP_norm.txt", sep = "\t")
biom file
b <- MRexperiment2biom(MGS, norm = T)
library(biomformat)
write_biom(b, biom_file = "GP_norm.biom")
library(RCurl)
import_biom2_script <- getURL("https://gist.githubusercontent.com/jnpaulson/324ac1fa3eab1bc7f845/raw/2ef62334d4e9bc5446a5ee6dd198f52484097dae/import_biom2.R", ssl.verifypeer = FALSE)
eval(parse(text = import_biom2_script))
BIOM FILE
biom <- read_biom("GP_norm.biom")
biom2 <- import_biom2(biom)
sort(sample_sums(biom2))
This really did nothing to fix the variation in sampling depth. I find CSS will sometimes decrease the fold-difference in sampling depth but not always. Your choice on whether to use. Maybe if it increases the fold-difference, don't use...but I have nothing to back that up. I use these packages, and have not helped develop them
max(sample_sums(biom2))/min(sample_sums(biom2))