Dear Qiimers,
I am trying to set up a database and training the classifier for a marker gene other than 16S. I have the primers set and a fasta file; the sequences are currently very different lengths (500 to 5,000 nts), and I am not sure there will be duplicates after the step ‘extract reads’ step. I have a couple of questions regarding it:
-
I did a test run using the GreenGenes database, and I have noticed that after the ‘extract-reads’, the number of reference sequences is about half or the original set (99_otus.fasta). How does it work? Does the algorithm merges multiple sequences if they become identical (after extracting the intra-primers segment)? And then merges the taxonomic assignment to the lowest common classification?
-
Should I trim my reference sequences to length, then cluster them at a given similarity percentage before generating the database? At this point I am planning to run Deblur (or DADA2) on my sequence data, then use the taxonomic assignment just as a rough guide.
Hope this makes sense, please let me know if you need additional details.
Thank you for your kind attention,
Max