A larger sample size leads to fatal error when analysing with QIIME

Yujun · May 11, 2017, 1:35am

Hi, @jairideout! OK, let's focus on QIIME2.

###1.
My QIIME2 version is 2017.2.0.

###2.

The exact commands I use in QIIME2 were:

qiime tools import --input-path ./01.data --type SampleData[SequencesWithQuality] --output-path demux.qza
qiime dada2 denoise-single --i-demultiplexed-seqs demux.qza --p-trim-left 0 --p-trunc-len 100 --o-representative-sequences rep-seqs.qza --o-table table.qza --verbose
qiime feature-classifier classify --i-classifier ../97_classifier.qza --i-reads rep-seqs.qza --o-classification taxonomy.qza
qiime taxa collapse --i-table table.qza --i-taxonomy taxonomy.qza --p-level 2 --o-collapsed-table table-l2.qza
qiime tools export table-l2.qza --output-dir table-l2
biom convert --to-tsv -i ./table-l2/feature-table.biom -o ./table-l2/feature-table.tsv --table-type "Taxon table"
qiime taxa collapse --i-table table.qza --i-taxonomy taxonomy.qza --p-level 6 --o-collapsed-table table-l6.qza
qiime tools export table-l6.qza --output-dir table-l6
biom convert --to-tsv -i ./table-l6/feature-table.biom -o ./table-l6/feature-table.tsv --table-type "Taxon table"

I compared the level 6 taxon table in R, the R code was:

name <- c("OTU", "ZY006", "ZY055", "2004",  "2017",  "2036",  "2049" )
d <- read.table('./l6_q2_z7.tsv', header = T, sep = '\t', check.names = F)
#select the most unstable samples
d_sel <- d[, colnames(d) %in% name]
colnames(d_sel) <- paste0(colnames(d_sel), c('', rep('_q2', 6)))

e <- read.table('./feature-table_stand.tsv', header = T, sep = '\t', check.names = F)
#select the most unstable samples
e_sel <- e[, colnames(e) %in% name]
colnames(e_sel) <- paste0(colnames(e_sel), c('', rep('_q2_latter', 6)))
comb <- merge(d_sel, e_sel, by = 'OTU', all = T)
comb <- na.omit(comb)
cor(comb[-1])
library(pheatmap)
pheatmap(cor(comb[-1]))
comb_t <- comb[order(apply(comb[-1], 1, sd), decreasing = T),]
write.table( comb_t,'merge.txt', quote = F, sep = '\t', row.names = F)

3.

The 60 samples and 240 samples were from two independent sequencing runs. So, I think you are right, the 'batch/run effect' may be the prime reason that no significance differences could be detect between the two groups. THANK YOU! Now, I am focus on the error that several samples have different taxon table from two QIIME2 analysis (60 samples and 300 samples). A larger sample size leads to more reads to analysis one time, I think larger number of reads may introduce some mistakes in OTU clustering. So, I am going to run more analysis with different sample size (like 150, 200 samples). Meantime, I trained the 97% classifier myself from the data I download from Greengene. The classifier may be unreliable. I will re-train a classifier based on SILVA database and then do the same compare.
Thank you very much!