Simulated data?

Hi all,

Not sure if this is the right place to post this (so please redirect as necessary.) I’m trying to simulate a microbial community off a pre-defined set of reads. I’ve been through like 8 illumina read profilers and each one has one major flaw that makes them not quite right for my application. (Mason takes out a random fragment, InSilicoSeq doesn’t allow you to set read lengths, Grinder won’t give me quality scores for all the money i pour into the swear jar.) Im wondering if anyone else uses simulated communities for benchmarking, and if they do, if they can suggest the tool they use to simulate reads.



Hello Justine,

Art? :art:
BEAR? :bear:
EAGLE? :eagle:
neat-genreads? :this-is-not-an-emoji:

Twist: Depending on your question, you could try a different tactic: What if you don't simulate reads, but instead simply randomly subsample and mix from real samples? This gives you mock samples with known composition, in which the errors and issues are 100% real.


1 Like

Hi @colinbrislawn,


I just pulled Art, and I’m going to look into Eagle more. The lack of emoji-able software is frustrating!

I’m doing some subsampling, but its a bit of a weird application! I need multiple hyper-variable regions from the sample sample in known mixtures that were sequenced on Illumina. I’ve got several sets of primers, so I think using Grinder to simulate PCR and then some other platform to sequence the Grinder reads is probably my best approach.



I just wanted to update and say that Art :art: seems to be the best answer, in case anyone else runs into a similar application.



I have found CAMISIM very useful in the past for simulating reads. It wraps a few of the tools mentioned here in an easier to use format. The best part is that it takes biom files as input.


Does CAMISIM work with amplicon? When I read the paper, it looked like it was only for metagenomics?