Deblur workflow questions

Hello,

I have been trying to run the deblur workflow directly (independently of qiime) on samples and I have a few different questions about deciding on parameters and interpreting the results. FYI I am using a cyanobacteria-specific ITS region and thus have used my own database for positive filtering.

1 - When I set a specific trim length, what is the reason the sequences are trimmed from one side vs the other?

2 - What are all the various grounds on which sequences are eliminated? In my current analysis, all of my samples have >20,000 reads but after deblur the number of sequences per sample ranges from 500 - 10,000, with most <7000. I have fiddled with different trim lengths as well as hanging on to singletons. (In this case my region is ~350 bp and I have tried various trim lengths from 250 to 350) I feel fairly confident that most sequences should be hits to my fasta database. So I can’t figure out where I’m losing so many sequences!

(my commands is:)

deblur workflow --seqs-fp all_ITS_June2018_fordeblur.fasta --output-dir all_ITS_deblur_output_300_singletons_choidb -t 300 --min-reads 1 --pos-ref-fp /ITS_db_ref_edited2.fasta -w

3 - What does it mean when there are Ns in the consensus sequences? I have one consensus sequence in some of my runs that is all Ns - it’s not abundant in the OTU table but I’m not sure what it means…

4 - From what I understand, one of the strengths of deblur is that you don’t have to re-run all of the data when you add new samples to a dataset. But I am not sure how to add new samples to an existing deblurred set. Do you concatenate the new sequences to the all.seqs.fa from the previous deblur run?

Thank you!

Hi @maitreyi, let me take a stab at your questions about Deblur.

1- Sequences are trimmed from the 3’ end (the last bases to be sequenced) because there is a decline in quality as the sequencing proceeds. Typically, bases at the 5’ end are high quality, although there may be lower quality in the first position.

2- Sequences are removed if they differ by only ~1-2 bases from a much more abundant sequence, according to an error model. Assuming an average error rate of 0.006 per position, and sequences 300 bp in length, 1-(1-0.006)^300 = 83.6% are expected to contain at least one error. Assuming this error rate and assuming Deblur works perfectly, a sample with 20,000 sequences would have 3,280 sequences left after Deblur. So the algorithm may be working as expected. (In the EMP paper, with -t 90, we removed ~half of the sequences.)

3- It’s possible you could have Ns in your representative sequences if there are Ns in your actual sequences. But I would think those would be filtered out. So I’m not sure.

4- With a fasta file, you would need to use a script to merge with another fasta file. With BIOM tables, it’s easy to merge using merge_otu_tables.py in QIIME 1. However both are straightforward if you’re using QIIME 2 artifacts. The commands are qiime feature-table merge-seqs and qiime feature-table merge, respectively. It’s worth learning QIIME 2. It has active development, more features than QIIME 1, and it’s supported (on this forum). :grinning:

Luke

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.