Slurmctld keeps dying


#1

Hi everyone,

I’m running SLURM on a cluster of max 20 nodes on AWS with auto scaling.

Things were going well for about a day, but now I’ve noticed that slurmctld keeps dying on the login node.

I can restart it with sudo systemctl start clusterware-slurm-slurmctld.service, then it runs for a while and exits again.

I’ve looked for errors in the log file:

sudo grep error /var/log/slurm/slurmctld.log

And I see lots that look like:

error: Node flight-064 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.

Indeed the slurm.conf files are different between the login node and the compute node, but I’ve no idea why they are different. I assume they were changed by SLURM itself, or perhaps by autoscaling?

I also see errors like this:

`error: build_part_bitmap: invalid node name flight-158`

I’ve got about 100 jobs running, and I’d like to keep them running if possible.

Cheers,
Bernie


#2

Hi Bernie,

We wouldn’t normally expect to see that differences in slurm.conf files should cause slurmctld to die. We’d highly recommend ensuring that these are the same as it can cause odd problems - this maybe due to the changes you were making to your slurm.conf from your other forum posts? Maybe a quick check with diff will show you the changes and hopefully pinpoint when this is happening?

The first thing to check if your slurmctld is stopping for some reason is whether you’ve got any fatal: errors showing in your slurmctld.log file - these normally indicate the error that caused the stoppage of slurmctld.

If you still experience issues with slurmctld stopping, we’d suggest maybe modifying the systemd unit file for clusterware-slurm-slurmctld.service to automatically restart after a stoppage - this should at least prevent issues from it stopping entirely.

Hope this helps,
Ruan


#3

Thanks Ruan,

When I diff slurm.conf from a node and slurm.conf from the login node I see:

>  diff flight-042.slurm.conf /opt/clusterware/opt/slurm/etc/slurm.conf
> 159c159,169
> < PartitionName=m4.xlarge  Nodes=autoscaling-slot-m4.xlarge-1, 
> ---
> > PartitionName=m4.xlarge                                                                                                                                                                                 Nodes=flight-093,flight-099,flight-136,flight-180,flight-165,flight-095,flight-167,flight-036,flight-049,flight-204,flight-147,flight-128,flight-178,flight-168,flight-042,flight-080,flight-074,flight-067,flight-068,flight-197,autoscaling-slot-m4.xlarge-20,autoscaling-slot-m4.xlarge-19,autoscaling-slot-m4.xlarge-18,autoscaling-slot-m4.xlarge-17,autoscaling-slot-m4.xlarge-16,autoscaling-slot-m4.xlarge-15,autoscaling-slot-m4.xlarge-14,autoscaling-slot-m4.xlarge-13,autoscaling-slot-m4.xlarge-12,autoscaling-slot-m4.xlarge-11,autoscaling-slot-m4.xlarge-10,autoscaling-slot-m4.xlarge-9,autoscaling-slot-m4.xlarge-8,autoscaling-slot-m4.xlarge-7,autoscaling-slot-m4.xlarge-6,autoscaling-slot-m4.xlarge-5,autoscaling-slot-m4.xlarge-4,autoscaling-slot-m4.xlarge-3,autoscaling-slot-m4.xlarge-2,autoscaling-slot-m4.xlarge-1,                                                                                                                                                                                
> > NodeName=autoscaling-slot-m4.xlarge-10 CPUs=4 RealMemory=15885 State=FUTURE
> > NodeName=autoscaling-slot-m4.xlarge-11 CPUs=4 RealMemory=15885 State=FUTURE
> > NodeName=autoscaling-slot-m4.xlarge-12 CPUs=4 RealMemory=15885 State=FUTURE
> > NodeName=autoscaling-slot-m4.xlarge-13 CPUs=4 RealMemory=15885 State=FUTURE
> > NodeName=autoscaling-slot-m4.xlarge-14 CPUs=4 RealMemory=15885 State=FUTURE
> > NodeName=autoscaling-slot-m4.xlarge-15 CPUs=4 RealMemory=15885 State=FUTURE
> > NodeName=autoscaling-slot-m4.xlarge-16 CPUs=4 RealMemory=15885 State=FUTURE
> > NodeName=autoscaling-slot-m4.xlarge-17 CPUs=4 RealMemory=15885 State=FUTURE
> > NodeName=autoscaling-slot-m4.xlarge-18 CPUs=4 RealMemory=15885 State=FUTURE
> > NodeName=autoscaling-slot-m4.xlarge-19 CPUs=4 RealMemory=15885 State=FUTURE
> 160a171,179
> > NodeName=autoscaling-slot-m4.xlarge-20 CPUs=4 RealMemory=15885 State=FUTURE
> > NodeName=autoscaling-slot-m4.xlarge-2 CPUs=4 RealMemory=15885 State=FUTURE
> > NodeName=autoscaling-slot-m4.xlarge-3 CPUs=4 RealMemory=15885 State=FUTURE
> > NodeName=autoscaling-slot-m4.xlarge-4 CPUs=4 RealMemory=15885 State=FUTURE
> > NodeName=autoscaling-slot-m4.xlarge-5 CPUs=4 RealMemory=15885 State=FUTURE
> > NodeName=autoscaling-slot-m4.xlarge-6 CPUs=4 RealMemory=15885 State=FUTURE
> > NodeName=autoscaling-slot-m4.xlarge-7 CPUs=4 RealMemory=15885 State=FUTURE
> > NodeName=autoscaling-slot-m4.xlarge-8 CPUs=4 RealMemory=15885 State=FUTURE
> > NodeName=autoscaling-slot-m4.xlarge-9 CPUs=4 RealMemory=15885 State=FUTURE`

These are not changes I have made to the file, but appear to be changes made by autoscaling when it is adding nodes.

The only change I make to slurm.conf is:

/usr/bin/sed -i '/#Epilog=/c\Epilog=/opt/clusterware/opt/slurm/bin/epilog.sh' /opt/clusterware/opt/slurm/etc/slurm.conf`

That is, I just change the Epilog line, and I do that consistently on every node, including the login node.

I don’t see any “fatal” errors in /var/log/slurm/slurmctld.log.

Though I do see lots of lines like:

error: Nodes flight-[093,099,136,180] not responding

Cheers,
Bernie


#4

Hi Bernie,

Those slurm.conf differences don’t look like they’re the underlying cause of the problem. Bit surprised there are no fatal messages in the slurmctld.log files as these would usually be present when slurmctld dies due to a configuration problem etc.

Does the problem seem to be occurring during the autoscaling procedure - i.e. are the nodes all up during the time the problem is occurring, or are they coming and going? There may be further clues to be found in the /var/log/clusterware logs, specifically the autoscaling.log and cluster-slurm.log files.

As @ruan.ellis mentioned, it might be worth trying to configure the clusterware-slurm-slurmctld.service to automatically restart. Though obviously this is treating the symptom rather than the cause, it could be enough until we’re able to get to the bottom of what’s causing slurmctld to fail. :hammer:

A modification to the service file as below will convince it to keep attempting to restart every 5s until it succeeds:

#Restart=on-failure
Restart=always
#RestartSec=1
RestartSec=5
StartLimitIntervalSec=0

Thanks,

Mark.