In the core version of StarCluster, when you add many nodes at once (command “addnode -n #”), StarCluster goes through three sequential checks[*] that all nodes need to fulfill in order to move forward and eventually start configuring the nodes within the cluster.
- Wait for the spot instance requests to propagate.
- Wait for all spot instance requests to become active.
- Wait for ssh on all those nodes to be active.
If you add a single node, that’s fine, but if you add 10, you lose some time as the first node might be ready a few minutes before the last node is. That is to say, you are wasting some computing time.
So what I did was somewhat simple. I removed that “all nodes must be ready at the same time” condition and went on with a streaming process. Each nodes still go through the same checks, but individually and as soon as ssh is finally up on a node, it gets added. That means that there is no more computing time wasted while waiting for the rest of the nodes to come up.
That optimization might not look as a huge gain. If you only start 2 nodes at once, that’s right. If you however start say 10 nodes at once, depending on the market and the instance type, you may see minutes go by between the time the first and the last node is ready. Put that on a sizable cluster over a long period and there you go. Cost reduction is always welcome right?
If you are interrested by that feature, checkout this streaming_node_add pull request. You are also welcome to have a look at our vanilla improvements version which is, as the name imply, a version having various improvements over the vanilla version. This is the base version of what we use in production at Datacratic.
* That is for adding spot instances. When adding on-demand instances two checks are made.