Deblur workflow questions

Luke_Thompson · August 3, 2018, 4:21pm

Hi @maitreyi, let me take a stab at your questions about Deblur.

1- Sequences are trimmed from the 3' end (the last bases to be sequenced) because there is a decline in quality as the sequencing proceeds. Typically, bases at the 5' end are high quality, although there may be lower quality in the first position.

2- Sequences are removed if they differ by only ~1-2 bases from a much more abundant sequence, according to an error model. Assuming an average error rate of 0.006 per position, and sequences 300 bp in length, 1-(1-0.006)^300 = 83.6% are expected to contain at least one error. Assuming this error rate and assuming Deblur works perfectly, a sample with 20,000 sequences would have 3,280 sequences left after Deblur. So the algorithm may be working as expected. (In the EMP paper, with -t 90, we removed ~half of the sequences.)

3- It's possible you could have Ns in your representative sequences if there are Ns in your actual sequences. But I would think those would be filtered out. So I'm not sure.

4- With a fasta file, you would need to use a script to merge with another fasta file. With BIOM tables, it's easy to merge using merge_otu_tables.py in QIIME 1. However both are straightforward if you're using QIIME 2 artifacts. The commands are qiime feature-table merge-seqs and qiime feature-table merge, respectively. It's worth learning QIIME 2. It has active development, more features than QIIME 1, and it's supported (on this forum).

Luke