I have been trying to run the deblur workflow directly (independently of qiime) on samples and I have a few different questions about deciding on parameters and interpreting the results. FYI I am using a cyanobacteria-specific ITS region and thus have used my own database for positive filtering.
1 - When I set a specific trim length, what is the reason the sequences are trimmed from one side vs the other?
2 - What are all the various grounds on which sequences are eliminated? In my current analysis, all of my samples have >20,000 reads but after deblur the number of sequences per sample ranges from 500 - 10,000, with most <7000. I have fiddled with different trim lengths as well as hanging on to singletons. (In this case my region is ~350 bp and I have tried various trim lengths from 250 to 350) I feel fairly confident that most sequences should be hits to my fasta database. So I can’t figure out where I’m losing so many sequences!
(my commands is:)
deblur workflow --seqs-fp all_ITS_June2018_fordeblur.fasta --output-dir all_ITS_deblur_output_300_singletons_choidb -t 300 --min-reads 1 --pos-ref-fp /ITS_db_ref_edited2.fasta -w
3 - What does it mean when there are Ns in the consensus sequences? I have one consensus sequence in some of my runs that is all Ns - it’s not abundant in the OTU table but I’m not sure what it means…
4 - From what I understand, one of the strengths of deblur is that you don’t have to re-run all of the data when you add new samples to a dataset. But I am not sure how to add new samples to an existing deblurred set. Do you concatenate the new sequences to the all.seqs.fa from the previous deblur run?