Asking for some details regarding the feature classifier - training for marker gene database

Dear Qiimers,

I am trying to set up a database and training the classifier for a marker gene other than 16S. I have the primers set and a fasta file; the sequences are currently very different lengths (500 to 5,000 nts), and I am not sure there will be duplicates after the step ‘extract reads’ step. I have a couple of questions regarding it:

  1. I did a test run using the GreenGenes database, and I have noticed that after the ‘extract-reads’, the number of reference sequences is about half or the original set (99_otus.fasta). How does it work? Does the algorithm merges multiple sequences if they become identical (after extracting the intra-primers segment)? And then merges the taxonomic assignment to the lowest common classification?

  2. Should I trim my reference sequences to length, then cluster them at a given similarity percentage before generating the database? At this point I am planning to run Deblur (or DADA2) on my sequence data, then use the taxonomic assignment just as a rough guide.

Hope this makes sense, please let me know if you need additional details.

Thank you for your kind attention,

Sequences that do not match within X mismatches are dropped (see the help docs — there is a parameter to adjust the % mismatch tolerance). There is no dereplication going on.

See the notes in the feature classifier training tutorial. It helps, but it is not essential.

That is not necessary, though it would make some steps more efficient. As yet, we have no best-practices guide for how to do this, nor can that clustering be achieved using QIIME 2 (you can cluster the sequences but there is no way to create a consensus taxonomy from those clusters).

Good luck!

Thank you for the additional details and advice.

Good luck!

Thanks :wink:

Have a nice day,

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.