What AWS EC2 size is best for teaching Qiime2?

PatrickO · February 11, 2019, 8:04pm

Hello, I am a grad student TA trying to teach qiime2 to a couple of small groups in the fall. I thought a good way to do this would be to set up an AWS EC2 instance the students could log into and run the tutorials on (since their personal pcs are often windows, teaching linux command line often requires this anyway).

So I figured out the free tier for EC2 is simply not enough to run qiime2 (please correct me if I'm wrong here!). I'd like to ask the department to pay for this resource, but I'd like to keep it small enough to just run the basics. What tier of EC2 do you think I'll need to buy to run this? How much space do you put on your instances that run qiime?

Sorry if this is a really newbie question, I just really want to make this work!

Patrick

thermokarst · February 13, 2019, 8:23pm

I can't really give you specific guidance, since this depends on so many factors (how many students? how many accounts per machine? what dataset will you run? what workflows will be run?), but I can give you some general guidelines for what the QIIME 2 workshop instructors use when teaching QIIME 2 Workshops.

We typically deploy one m4.2xlarge for every 7ish students in attendance (we typically run small ad hoc clusters when we teach open-enrollment workshops). By "ad hoc," I mean, these clusters are short-lived --- we spin them up the day before a workshop and tear them down at the conclusion of the workshop.

Now, with that said, the m4.2xlarge is probably overkill, we just don't want to be in a situation where 75 workshop attendees are left in the lurch just because we opted for a smaller EC2 instance. So, that m4.2xlarge has 8 vCPUs and 32 GB RAM - we generally shoot for one CPU per tenant on the machine. Keep in mind, that isn't practical when using a "real" dataset --- we teach the QIIME 2 workshops with a "lite" dataset that has been trimmed and crafted to go easy on the resources (in the past we typically use the Moving Pictures tutorial dataset).

Hope that helps! :qiime2:

devonorourke · June 18, 2019, 11:17am

@PatrickO,
A little late to the party but I thought it was worth highlighting an idea that @thermokarst already mentioned: workshops don't have to focus on using everyone's unique (and full) datasets - they can start with a single workable example that everyone tackles.
If you take that approach, then you can scale down the AWS EC2 instance needs dramatically. You can also generate the files that take the longest and just have folks jump over one or two parts. In my experience that's the denoising part, but if you have a subsampled dataset you can get it down to minutes instead of hours. In a high school or college class setting where time is fixed in < 1 hr. blocks, I've adopted a prepackaged approach. Sure, I let students execute the code, and there's no substitute for a new user getting the satisfaction of plugging in some text and seeing this magical file in the output. But if things go awry (instance session disconnects for some unknown reason, student hits rm * to see what happenes, etc.), I already have all the files needed to keep the class moving forward even if they make a mistake. That also means that you can skip ahead if you're running short on time, even if everything is working smoothly.

I don't see any reason why you couldn't get away with free tier resources if you're willing to have already loaded up the necessary resources. If you want to skip with denoising for example (or classification, or whatever), just have the finished (expected) artifact in a repository they can curl/wget there way into.

I wonder if you could do this entire thing in a Binder/Jupyter notebook? That's probably been done before but I haven't tried.

Good luck, and don't be afraid to keep asking big questions - there's no harm here in asking for broad advice on how to tackle something, with the understanding that you might get 10 different responses.