Questions about mock microbiome data sets

Nicholas_Bokulich · January 2, 2020, 9:36pm

It sounds like these have the wrong PHRED encoding, as @Mehrbod_Estaki mentioned (only the barcodes have artificial qual scores so that is not the issue, but the encoding is). It is not your fault — you are using the correct command but right now the EMP format does not have a way to select the Phred encoding, and it just assumes that you are using Phred33 (Phred64 is an old encoding! mock-8 is a very old dataset). If you want to use mock-8, see the workaround described here:

For all benchmarking it is best to use multiple datasets so that you aren't overfitting on a single dataset or single mock community. So there is no "best", but I can offer some advice. The smaller the number, the older the dataset... so try mock-16 through mock-23, these are more up-to-date 16S datasets (i.e., more "modern" read lengths, and probably Phred64 but don't know off-hand).

Some of the older datasets are very useful, and e.g., mock-3 is very small so it is useful for setting up a test pipeline quickly, but older datasets can have issues with obsolete data formats as you have discovered.

If you have or find more recent mock communities please feel free to post them in mockrobiota so that others can re-use.

I hope that helps!