Simulated data?

jwdebelius · April 3, 2019, 4:30pm

Hi all,

Not sure if this is the right place to post this (so please redirect as necessary.) I'm trying to simulate a microbial community off a pre-defined set of reads. I've been through like 8 illumina read profilers and each one has one major flaw that makes them not quite right for my application. (Mason takes out a random fragment, InSilicoSeq doesn't allow you to set read lengths, Grinder won't give me quality scores for all the money i pour into the swear jar.) Im wondering if anyone else uses simulated communities for benchmarking, and if they do, if they can suggest the tool they use to simulate reads.

Thanks,
Justine

colinbrislawn · April 11, 2019, 11:38pm

Hello Justine,

Art?
BEAR?
EAGLE?
neat-genreads? :this-is-not-an-emoji:

Twist: Depending on your question, you could try a different tactic: What if you don't simulate reads, but instead simply randomly subsample and mix from real samples? This gives you mock samples with known composition, in which the errors and issues are 100% real.

Colin

jwdebelius · April 12, 2019, 8:35am

Hi @colinbrislawn,

Thanks!

I just pulled Art, and I'm going to look into Eagle more. The lack of emoji-able software is frustrating!

I'm doing some subsampling, but its a bit of a weird application! I need multiple hyper-variable regions from the sample sample in known mixtures that were sequenced on Illumina. I've got several sets of primers, so I think using Grinder to simulate PCR and then some other platform to sequence the Grinder reads is probably my best approach.

Best,
Justine

jwdebelius · April 16, 2019, 1:28pm

I just wanted to update and say that Art seems to be the best answer, in case anyone else runs into a similar application.

Thanks!

cmartino · April 16, 2019, 11:22pm

I have found CAMISIM very useful in the past for simulating reads. It wraps a few of the tools mentioned here in an easier to use format. The best part is that it takes biom files as input.

jwdebelius · April 17, 2019, 7:37am

Does CAMISIM work with amplicon? When I read the paper, it looked like it was only for metagenomics?