We’d usually expect the node to be removed automatically when it is detected as having left the cluster. It sounds like this hasn’t happened for some reason, possibly because the scheduler can refuse to remove nodes when it detects a job is in an uninterruptible state within the job queue. Also, if you’re using an older release of Flight Compute, it’s possible you’re hitting an issue that’s since been fixed.
You should be able to manually remove the node by running a few commands (as root):
# Ensure the gridscheduler module is loaded
module load services/gridscheduler
# Remove the node from the allhosts and core count-specific hostgroups
qconf -mhgrp @allhosts
qconf -mhgrp @8slot
# Remove the node from execution host list
qconf -de ip-10-75-XXX-YYY
You’ll probably also need to edit the
/etc/hosts files to remove any unnecessary entries.
It’s also worth noting that under some circumstances the easiest approach is to terminate the cluster and fire up another one! The Flight Compute Solo product on AWS Marketplace is primarily intended to provision short-running, ephemeral clusters. If you need a solution for a longer term Flight Compute cluster, please drop an email to email@example.com and one of our consultants can get in touch to suggest some options that might suit you better.