We want to compare different qiime2 naive-bayes classifiers (trained with different databases) to assess which database work best with our data. We plan to create a test dataset with know composition to compare the performance of the classifiers. We were wondering how to perform the taxonomy assignment of the test dataset. It is necessary to simulate the illumina reads and create the rep-seqs or can we use directly the fasta file of the test dataset as an input for the feature-classifier classify-sklearn ? Which option is more correct?
If the main focus is for comparing taxonomic assignment using different classifiers, then you can stick with the rep-seqs file (at least dereplicated sequence data). You might be interested in several tools for such comparative analyses using the RESCRIPt plugin. You can read our paper for more details.
Thanks for your quick response. If I had understood well, you recommend to use the rep-seqs file for the taxonomic assigment, but why not using the fasta file of the test dataset directly? The test dataset has been created with sequences from a database.
I guess I miss-understood what each of these files were going to be used for... Upon re-reading I think I follow what you were asking. You can use the fasta file directly, no need to simulate anything.