2 taxa of 18S mock communities are missing after deblur dereplication step

LivYeh · April 30, 2019, 12:06am

Hi all,

I'm using deblur on 18S mock communities amplified using V4 primers and found that 2 taxa are missing at the ASV table. Then, I looked into the stats and found out they are discarded at the dereplication step, which is out of my expectation. I wonder if anyone ran into the same issue and figured a way to fix it.

Nicholas_Bokulich · May 2, 2019, 12:20pm

Do those taxa have 100% sequence similarity (or high similarity) with other taxa in the V4 region? it sounds like that is the case, in which case there is probably nothing you can do.

LivYeh · May 2, 2019, 6:42pm

Thanks for your response.
These two taxa are similar but not 100% identical. I thought that may happen at the denoising stage because deblur may recognize them as artifacts, but they got removed at the dereplication stage, which raised my concern. I actually tried both DADA2 and deblur. DADA2 gave me the exact number of taxa I put in, whereas deblur removed two of them. I know I can just use DADA2 for my work, but feel it's good to let people be aware of this issue and maybe debug it somehow.

Nicholas_Bokulich · May 2, 2019, 6:45pm

Thanks @LivYeh! I will notify the deblur developers — they may follow up here if they have questions.

And I agree: for your analysis, it makes sense to move ahead with dada2 if that "just works".

wasade · May 2, 2019, 8:44pm

Hi @LivYeh,

Just wanted to check, how was it determined the sequences were removed during the dereplication step, and would it be possible to share the stats output? And, f you have a moment, would it be possible to share the command line calls made to execute both Deblur and DADA2?

All the best,
Daniel

LivYeh · May 3, 2019, 12:24am

I attached the stats.csv here.

Below are command lines
qiime deblur denoise-other
--i-reference-seqs /home/db/bbsplit-db/SILVA_132_and_PR2_EUK.cdhit95pc.qza
--i-demultiplexed-seqs 04-QCd/filtered_sequences.qza
--p-trim-length 380
--p-sample-stats
--output-dir 09-deblurred

qiime dada2 denoise-paired
--i-demultiplexed-seqs 18s.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 220
--p-trunc-len-r 220
--output-dir 03-DADA2d
--p-n-threads 10

stats.csv (1.5 KB)

wasade · May 6, 2019, 4:37pm

Thank you, @LivYeh. Unique dereplicated reads are the number of unique sequences within a sample. For example, if there were three sequences in your sample:

AAATTTGGG
AAATTTGGG
CCCTTTGGG

The dereplication would reduce that to the following two unique sequences:

AAATTTGGG
CCCTTTGGG

What the stats output suggests to me is that the sequences expected were dropped prior to execution of deblur. There is an additional confounding between the dada2 and deblur output as well since different input data is used. Would it be possible to rerun deblur using the same input as dada2? This may help to better isolate where the sequences you're expecting are dropping out. Note as well that the total number of input reads is pretty small, so the frequency of a given unique sequence will be small as well.

Best,
Daniel

LivYeh · May 7, 2019, 5:08pm

Hi Daniel,

Thanks for your explanation.
I should explain my amplicon data more clearly. It’s HiSeq PE250. That'se why input files for two pipelines are different (Deblur requires merged paired-end reads but data2 demands reads without merging as input files). So I don’t know how to rerun Deblur using the same input as dada2.

The first 4 samples in the stats spreadsheet are in silico mock communities which are used for pipeline testing, so their sample size is relatively small.

Best,
Liv

wasade · May 7, 2019, 5:35pm

Ah, okay thanks. Is it possible that the merged reads are not sufficiently long to differentiate the mock community organisms?

Best,
Daniel

LivYeh · May 7, 2019, 11:15pm

Hi Daniel,

But then I should get the same answer with DADA2. The same amplicon data were processed with both pipelines. With Deblur, I merged reads using Vsearch and then denoised with Deblur. With DADA2, I denoised and then merged reads both in DADA2.

I think my main question is what "reads-derep" column means in the stats spreadsheet. If it means reads get removed at dereplication step, why though? I thought dereplication produces unique sequences without discarding anything.

Thanks,
Liv

wasade · May 7, 2019, 11:47pm

I don't think that assumption is safe as the processing is different; is there sufficient information in the result of the merging with vsearch to differentiate the missing organisms from the mock community?

Just to verify, are the additional results from DADA2 correct such that the observed sequences are a perfect match to mock community members not observed by Deblur?

The reads-derep column expresses the result of applying vsearch's dereplication method to an input sample. This step is performed so that subsequent multiple sequence alignments, with MAFFT, are performed with only unique sequences. This in part helps to avoid substantial excess compute on redundant data. Nothing is discarded in the depreplication as the number of replicated sequences is tracked.

Best,
Daniel

LivYeh · May 8, 2019, 12:24am

Hi Daniel,

I attached the in silico mock files after merging with vesearch here If you don't mind taking a look. insilico_18S_mock_merged_Vsearch.zip (29.3 KB)

I find 10 unique sequences in even mock and 16 unique sequences in staggered mock, but end up having 9 in even and 14 in staggered after running deblur.

And yes, the results from DADA2 are correct. The real mock communities perfectly match the in silico.

Thanks,
Liv

wasade · May 8, 2019, 1:17am

Hi @LivYeh,

There doesn't appear to be a difference between the QC and non QC files sent:

$ find . -name "*.gz" -exec gunzip {} \;
$ find . -name "*.fastq" -exec md5 {} \;
MD5 (./18S_insilico_mock_merged/18s-mock-EukV4YRR-staggered-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_1_L001_R1_001.fastq) = bd23b0520cfaf8e6f3cd6587826a30c5
MD5 (./18S_insilico_mock_merged/18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fastq) = cdd43f26b682158ddd42e39f222649e8
MD5 (./18S_insilico_mock_merged_QCd/18s-mock-EukV4YRR-staggered-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_1_L001_R1_001.fastq) = bd23b0520cfaf8e6f3cd6587826a30c5
MD5 (./18S_insilico_mock_merged_QCd/18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fastq) = cdd43f26b682158ddd42e39f222649e8

The number of unique reads in the even sample seems to make sense, and those reads are unique over the first 100 nucleotides as well:

$ grep "^[ATGC]" 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fasta | sort - | uniq | wc -l
      10
$ grep "^[ATGC]" 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fasta | sort - | cut -c 1-100 | uniq | wc -l
      10
$ grep "^[ATGC]" 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fasta | sort - | cut -c 1-100 | uniq -c
  10 AGCTCCAAGAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGTATAACGCGCCTGGCCCGCTTTTGTGAGTGCCGGTGCGCGT
  10 AGCTCCAATAGCGTATACTAACGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGTTAGGGTGAGGCGGCCGGCCACTCGTGGTTGTAGCTTGTTAT
  10 AGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGGAGGGTGCCGCCGTCCGGCGTGTCCGTGTGCAGTGGCGCCC
  10 AGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGTGGGAGTGATCGGTCCTTCACTTAGTGTTGGAACCTGATTG
  10 AGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAGAACGCTCGTAGTCGGATTTCGGGGCGGTCCGACCGGTCTGCCGATGGGTATGCACTGGTCGGAG
  10 AGCTCCAATAGCGTATATTAAAGTTGTTGCGGTTAAAAAGCTCGTAGTTGGATTTCTGCCGAGGACGACCGGTCCGCCCTCTGGGTGTGTATCTGGCTCG
  10 AGCTCCAATAGCGTATATTAAAGTTGTTGCGGTTAAAAAGCTCGTAGTTGGATTTCTGCTGAAGCAAACCGGTCCGCCCTCTGGGTGAGCATCTGGTTTT
  10 AGCTCCAATAGCGTATATTAAAGTTGTTGCGGTTAAAAAGCTCGTAGTTGGATTTCTGTCGGTGAGTGAAAGTCCGCTCTCAGTGGTTGGTACTTTTCAC
  10 AGCTCCAATAGCGTATATTAAAGTTGTTGTGGTTAAAAAGCTCGTAGTTGGATCTCAACAGCCTTGAAGCGGTTAACTTTGTAGTTTTACTGCTTTATAG
  10 AGCTCCAATAGCGTATGTTAAAGTTGTTGCGGCTAAAAAGCTCGTAGTTGGATTTCTAGATGAGAGGAGTGGTCCACTCTATGAGTGTGTATCTACTTCC

I then converted the input file to FASTA (as the PHRED type detection requires additional FASTQ record information), and ran the even sample through Deblur. Note that I trimmed to 100nt as the above assessment showed the data are unique over the first 100nt.

$ grep -A 1 "^@" 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fastq | grep -v "\-\-" | tr "@" ">" > 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fasta
$ deblur workflow --seqs-fp 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fasta --output-dir test -t 100

I observe 10 unique Deblur OTUs, which is what I'd expect given the input data.

$ grep -c "^>" test/all.seqs.fa
10

I then reran Deblur using the trim length you had used (380), and observed 9 sequences from Deblur:

$ deblur workflow --seqs-fp 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fasta --output-dir test-full -t 380
$ grep -c "^>" test-full/all.seqs.fa
9

On closer inspection, not all of the in silico reads are of the same length. The trim length of 380 would omit 10 sequences.

$ for i in $(grep "^[ATGC]" 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fasta); 
do   
    echo ${#i}; 
done | sort | uniq -c
  10 373
  40 381
  20 382
  20 383
  10 384

The sequences that are of length 373 all the same as well.

$ for i in $(grep "^[ATGC]" 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fasta); 
do 
    if [[ ${#i} -eq 373 ]]; 
    then   
        echo $i; 
    fi; 
done | sort | uniq -c
  10 AGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCTGGGAGGGTGCCGCCGTCCGGCGTGTCCGTGTGCAGTGGCGCCCTTCCATCCTTCTGTTAGCGTCTCTTGGCATTCATTTGCTGGTGGCGGGCTCAGATATTTTACCTTGAGAAAATTAGAGTGTTTCAGGCAGGCTAGGCCGGAATACATTAGCATGGAATAATGGAATAGGACTACGGTCTCTTTGTTGGTTTGAGGGACTGCAGTAATGATTAATAGGGATAGTTGGGGGCATTAGTATTTAATTGTCAGAGGTGGAATTCTCAGATTTGTTAAAGACTAACTTATGCGAAAGCATTTGCCAAGGATGTTTTCA

In the stats output, the reason there are only 9 dereplicated sequences is because the the filter for sequence length is applied before dereplication. A filter length of 380 would omit these 10 sequences, leaving only 9 unique ones left over. This can be confirmed by rerunning deblur with a trim length of 373:

$ deblur workflow --seqs-fp 18s-mock-EukV4YRR-even-insilico.trimmed.SILVA-132-and-PR2-EUK.cdhit95pc-1.fastq_0_L001_R1_001.fasta --output-dir test-full-373 -t 373
$ grep -c "^>" test-full-373/all.seqs.fa
10

Best,
Daniel

LivYeh · May 8, 2019, 10:15pm

Hi Daniel,

Thank you so much! I didn't realize the trim length option will discard reads that are shorter than the parameter I gave. I know there is an option to disable trimming by setting it -1. I wonder if you'd suggest to do it this way.

Thanks again,
Liv

wasade · May 8, 2019, 10:20pm

I'd advise a consistent trim length. Mixing lengths can be a source of technical bias. Some of this is discussed in Debelius et al.

Best,
Daniel

LivYeh · May 8, 2019, 10:42pm

OK. Thanks for pointing me this paper!

system · June 9, 2019, 4:42am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.