Loosing reads after merging and chimera removal step

pbaranw · March 9, 2020, 2:06am

Dear all,

I am using R studio to analyze my samples. I exported the demux-paired-end.qza data to R and started analyzing my samples.
I used the following tutorial
https://benjjneb.github.io/dada2/tutorial.html

I ran following commands for my data as follows

Filter and Trimming
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(200, 200),
maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
compress=TRUE, multithread= FALSE) # On Windows set multithread=FALSE

Merging paired reads
mergers <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose = TRUE)

The output here is :
abundance forward reverse nmatch nmismatch nindel prefer accept
1 284 18 13 106 0 0 1 TRUE
2 282 21 30 106 0 0 1 TRUE
3 268 12 54 106 0 0 1 TRUE
4 253 16 12 106 0 0 1 TRUE
5 228 19 19 106 0 0 1 TRUE
6 222 17 14 106 0 0 1 TRUE

#construct sequence table
seqtab <- makeSequenceTable(mergers)
dim(seqtab)

Output : 13 9641

Inspect distribution of sequence lengths

table(nchar(getSequences(seqtab)))
Output :
292 293 294 295 296 297 301 305
1 3 1865 4119 305 3345 1 2

seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=FALSE, verbose=TRUE)
dim(seqtab.nochim)
sum(seqtab.nochim)/sum(seqtab)

Output:
13 1431
0.5271426

getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))

colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
head(track)

Output:
input filtered denoisedF denoisedR merged nonchim
Bac18-041119-F2-R22_S233_L001_R1_001.fastq.gz 25548 24682 24185 23374 21075 13273
Bac18-041119-F2-R23_S249_L001_R1_001.fastq.gz 28864 28044 27416 26477 22356 18298
Bac18-041119-F2-R24_S265_L001_R1_001.fastq.gz 3323 3167 3085 3024 2440 2300
Bac18-041119-F3-R13_S74_L001_R1_001.fastq.gz 20947 19977 19558 19489 18377 10665
Bac18-041119-F3-R14_S90_L001_R1_001.fastq.gz 29165 28268 27616 27432 25183 12378
Bac18-041119-F3-R16_S122_L001_R1_001.fastq.gz 32142 31383 31061 30701 29621 15933

I think most of the reads are lost in removing chimera around 50%. What steps should I take to avoid so much loss of my reads in chimera removal.

Can someone please help.

Thank you so much!

jwdebelius · March 9, 2020, 11:00am

Hi @pbaranw,

Welcome to the :qiime2: forum. I've reclassified with as "other tools|" since we primarily support q2-dada2 rather than the R version.

Best,
Justine

Mehrbod_Estaki · March 9, 2020, 10:48pm

Hi @pbaranw,

The usual suspects come to mind here. A) Have you removed all your primers + all non-biological sequences from your reads before running dada2? B) Are your samples from a low-biomass source, or the ratio of host:microbial DNA is too high? These can often lead to excessive chimeras. C) How do the these "chimeric" reads look like? You can try blasting some of them to see. D) If you think real reads are being discarded as chimeric you can increase the minFoldParentOverAbundance parameter to say 4 or 8 and see if that helps.

pbaranw · March 11, 2020, 5:43pm

Hello @Mehrbod_Estaki

Thank you for the reply!

I am not sure whether I removed the primers are not. But I used the following command.

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(200, 200),
maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
compress=TRUE, multithread= FALSE) # On Windows set multithread=FALSE

head (out)

and I got the following result

                                                                                  reads.in    reads.out

Bac18-041119-F2-R22_S233_L001_R1_001.fastq.gz 25548 24682
Bac18-041119-F2-R23_S249_L001_R1_001.fastq.gz 28864 28044
Bac18-041119-F2-R24_S265_L001_R1_001.fastq.gz 3323 3167
Bac18-041119-F3-R13_S74_L001_R1_001.fastq.gz 20947 19977
Bac18-041119-F3-R14_S90_L001_R1_001.fastq.gz 29165 28268
Bac18-041119-F3-R16_S122_L001_R1_001.fastq.gz 32142 31383

Do you think my primers are removed through this step?

The quality plot for forward (R1) and reverse reads (R2) are as follows

The quality is looks okey to me so I trimmed at 200 position for both R1 and R2. I thought this method will trim the primer sequences also.

Mehrbod_Estaki · March 11, 2020, 8:43pm

Hi @pbaranw,
Assuming you are following the DADA2 online tutorial, have a look at the very first section:

Starting point

This workflow assumes that your sequencing data meets certain criteria:

Samples have been demultiplexed, i.e. split into individual per-sample fastq files.

Non-biological nucleotides have been removed, e.g. primers, adapters, linkers, etc.

So based on that 2nd line we see that you need to remove these before DADA2, meaning DADA2 itself is not doing this.
You can use a tool like cutadapt in Qiime2 to remove these, if they are in fact still in your reads. You can take a look through your fastq files to see if these have been removed or not. Use a command like zcat myfilename.fastq.gz | head to look through the first few lines.

system · April 12, 2020, 2:43am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.