Threads vs Nodes

jwdebelius · August 30, 2019, 8:27am

Hi All,

I'm trying to understand the difference between the --p-n-jobs and --p-threads (for instance, comparing qiime feature-classifier classify-sklearn and qiime fragment-insertion sepp). When I spin them on a server, I see multiple jobs on the server when I pass --p-n-jobs but not with --p-threads. Is threading internal?

Sorry, Im trying to understand how to speed up/do parallelisation/move to a new HPC environment.

Best,
Justine

ebolyen · August 30, 2019, 4:06pm

That's a great question @jwdebelius!

This has to do with how parallelism is achieved inside an operating system. There are a few predominant strategies, two of the most common being processes and threads.

A process is hopefully the more intuitive concept, it is how an operating system multi-tasks. So you would have a process for your browser (at least one anyhow), a process for your console, a process for your solitaire game, etc.

Each process lives in it's own virtual bubble of memory and related resources (like file references, and other namespaced things). The operating system calls this bubble the process table and it is unique to each process and keeps track of all the bookkeeping. The net effect is that a process cannot touch another process without jumping through some significant hoops.

(Incidentally whenever you see a segmentation fault, this means code running inside the procces bubble has attempted to escape its bubble. The operating system, having none of that, kills the process immediately.)

Now a process is allowed to make more processes (called a child-process), these can share some parts of the process table from the parent, but importantly, memory is not one of them (without jumping through hoops).

A thread on the other hand, is a much simpler construction, it doesn't have it's own process table, it lives inside the same process table as do all of the other threads in that process. This means it can modify memory whenever it sees fit (as long as it does not try to leave the bubble created by the operating system). This is both a strength and a terrible curse, as the programmer needs to very carefully consider the order in which memory is changed knowing that the specific order is actually entirely undefined.

Threads are really powerful for handling logically independent tasks (such as waiting for the disk to return some data). But extreme care needs to be taken when the threads are doing logically similar things as it becomes very easy to miss-step and have the threads walk over each other in undefined ways permuting memory arbitrarily (often cascading in the program corrupting itself to the point of attempting to leave the process's bubble, giving us a segfault).

In the context of QIIME 2 and HPC, when we use the term job we are talking about a process. In principle this process could live anywhere (it has its own memory and so is, from an implementation view, independent), so it could be sent to another node on the cluster provided the data is also shipped along. We don't have facilities for that at the moment, but that is why we use the more generic term.

When we use threads, those are very very directly tied to a single process table, and so cannot logically ever leave a given node. In other words, in the future, even if we could ship processes to another cluster node, the threads are still stuck and cannot ever grow beyond the capabilities of a given node (e.g. setting 64 threads on a node that has only 32 cpu cores would be pointless, as half will always be waiting for the other half, and since they are tied to the same process, they can't be placed on a different node in the cluster).

jwdebelius · August 30, 2019, 7:35pm

@ebolyen,

Thank you for such a simple description. As a follow up question, just so I understand, if I'm requesting resources, a multi-threaded job wants its own node, and a multi-process job can be spread out and therefore doesn't require a node allocation?

Thank you!
Justine

ebolyen · August 30, 2019, 9:46pm

Yes that is correct, except the very last bit isn't supported yet. So in practice (for now) you should treat them the same, you'll always need a single node large enough to fit the thread/job argument chosen.

In the future, I would like to see QIIME 2 interact with the queuing system itself so it can map out jobs as needed (when they are job-based and not thread-based).