Autoscaling from zero nodes


#1

Hello, how does autoscaler detect load with zero initial compute nodes?
We are experimenting with 2017.1r1 release. In my experience, the nodes spawned by autoscaler never go down but stay up doing nothing. It was like that with the prior version and it is still the case. The only option is to manually shutdown the extra nodes down to the min level.
Thanks!


How does AutoScaling work in detail?
#2

Hi @ink,

At the moment a zero-node configuration requires you to manage your initial node scale out manually.

To scale out from 0 nodes

  • open the AWS console and visit the EC2 dashboard
  • locate your autoscaling group
  • modify the “Desired” and “Min” values as required – I’d suggest simply increasing “Desired” to 1 as, once the first node has joined the cluster, autoscaling will proceed up to the “Max” value and back down to 0 when nodes are idle.

Note that increasing “Min” will cause the autoscaler to be unable to scale back to 0 nodes – instead it will scale back to the new “Min” figure.

To scale back to 0 nodes

  • if you have simply increased the “Desired nodes” figure then you can leave your nodes to go idle; they will automatically be scaled back to 0 nodes when detected idle as the end of their AWS billing hour is approached.
  • if you have also increased “Minimum nodes” then you’ll need to open the AWS console and set this figure back to 0. The nodes will be automatically scaled back to 0 nodes as above. You can also choose to set “Desired nodes” to 0 if you want the nodes to be scaled back immediately for any reason.

We’re looking into adding some tooling that can be used from within the cluster to ease this process in our next release. In the meantime, it’s possible to use something like this with the region and cluster name updated as appropriate (note you should be root):

[root@login1(mycluster) alces]# module load services/aws services/clusterware

[root@login1(mycluster) alces]# aws --region us-east-1 autoscaling describe-auto-scaling-groups | jq -r '.AutoScalingGroups[]|select(.AutoScalingGroupName|startswith("mycluster-")).AutoScalingGroupName'
mycluster-FlightComputeGroup-1KOKS8MR22KJZ

[root@login1(mycluster) alces]# aws --region us-east-1 autoscaling set-desired-capacity --auto-scaling-group-name mycluster-FlightComputeGroup-1KOKS8MR22KJZ --desired-capacity 1

This will update the autoscaling group and cause a single compute node to be spun up.

This shouldn’t happen while autoscaling is enabled – you can check that autoscaling is enabled using the alces configure autoscaling status command:

[alces@login1(mycluster) ~]$ alces configure autoscaling status
Autoscaling: enabled

#3

Thanks! This is useful. Increasing Desired does the trick.
As for scaling in, reading more about AutoScaling revealed that it takes into account billing periods for instances it terminates although I could not immediately find what the default termination period is. Perhaps I need to experiment more and wait for longer.


#4

Great! Here are a few more details on the scale-in timing that you might find useful:

The autoscaling cycle runs every 5 minutes. The process is roughly:

  • Detect which nodes are idle (i.e. have no jobs running on them)
  • Temporarily disable idle nodes in the scheduler so further jobs can’t get scheduled during the shutdown process
  • Determine whether the node is “exhausted” by which we mean that it is within the last 7 minutes of its current billing hour. This number is chosen so a node has at most 7 minutes run time left and at least 2 minutes, which gives it a chance to be terminated before slipping into a further billing hour.
  • If it is exhausted, shut it down, otherwise leave it alone until the next autoscaling cycle in 5 minutes.

HTH!


#5

Thank you. Very helpful!


#6

Where can I read more on how autoscaling works? Do I understand correctly that Flight is using simple autoscaling and also that openlava and slurm can somehow interact with it? Still, I could not find in AWS docs what the default CPU load should be for the autoscaling to kick in. The issue I’m having is AWS is probably thinking that the load is low at 50% due to HT doubling the core count. I have disabled HT and yet AWS still shows 50% load. Could this be why the pending job does not trigger a new node?
I understand I can manually increase the number of desired nodes but wondering if the autoscaling could do the trick.

=# bjobs -u all
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
105 xxx RUN normal login1 flight-076 *TKv2p1[4] Aug 11 10:30
flight-076
106 xxx PEND normal login1 index Aug 11 11:19

=# /var/log/clusterware/autoscaler.log

Aug 11 12:10:02 [autoscaler:setup_schedulers] Found autoscaling-capable scheduler(s): openlava slurm
Aug 11 12:10:02 [autoscaler:scale-out] Default autoscaling group is r4.large
Aug 11 12:10:02 [autoscaler:cores_for_group] Found cores for groups: xxx-FlightComputeGroup-11OEW4910UNB0:2
Aug 11 12:10:02 [autoscaler:cores_for_group] Selecting first group ‘xxx-FlightComputeGroup-11OEW4910UNB0:2’ with cores: 2
Aug 11 12:10:02 [autoscaler:aws-scale-out] Retrieving job state data for scheduler openlava in queues for r4.large
Aug 11 12:10:02 [autoscaler:aws-scale-out] Retrieving job state data for scheduler slurm in queues for r4.large
Aug 11 12:10:02 [autoscaler:aws-scale-out] Autoscaling group r4.large (xxx-FlightComputeGroup-11OEW4910UNB0) has demand for 1 nodes
Aug 11 12:10:03 [autoscaler] Performing scale-in check
Aug 11 12:10:03 [autoscaler:scale-in] Found empty nodes:

I read this output as autoscaler does not think another node is needed.


#7

AWS isn’t involved in the scale-up/scale-down processes, which are orchestrated by the master node. Autoscaling isn’t related to load reported by compute nodes but relates to the number of nodes/cores required to satsify jobs waiting in the queue.

The way that OpenLava reports jobs is not currently compatible with the autoscaling engine in the Community Edition of Flight Compute. The Community Edition currently provides autoscaling via the Slurm job scheduler as a supported configuration.


#8

I see. Based on the Autoscaler output I thought that Openlava is supported. Thank you!


#9

Got the same problem of scaling from 0 and just found this post. Manually changing the desired number did solve the problem. One small issue: jobs that are already in the queue cannot make use of newly-launched nodes. When the first compute node is added, Slurm reports "job has been allocated resources" but then crashes:

$ srun --pty /bin/bash
srun: Required node not available (down, drained or reserved)
srun: job 5 queued and waiting for resources
srun: job 5 has been allocated resources
srun: error: fwd_tree_thread: can't find address for host flight-158, check slurm.conf
srun: error: Task launch for 5.0 failed on node flight-158: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

Re-submitted jobs work fine. I am using 2017.2r1.