Module Request: Cluster Flow


#21

Yes exactly - around 4.6TB I think it is. We already have a research grant (though I’d rather spend it on something else) and am currently in touch with the open data team discussing a public dataset (since December). But thanks for the pointers!

Hah, ok yes a year - that’s great then. I like the idea of being able to leave a Cluster Flow setup with everything configured and ready to go, then it just spawning expensive compute nodes when I kick off a pipeline and them closing themselves automatically when done. No need to worry about forgetting to shut the cluster off when complete then. The 30GB limit will probably be limiting though. I may add a section to the docs at the end mentioning this possibility (but absolving myself of any risk!).

Phil


#22

Hi @mjtko,

Quick update to say that the paper revision is now live: https://f1000research.com/articles/5-2824/v2

Any progress on the module? Shout if there’s anything I can do to help :slight_smile:

Phil


#23

Hi @ewels,

The team has done some work on creating Gridware packages for the additional tools required by Cluster Flow as well as generating a Gridware package for Cluster Flow itself.

I hope to be able to get back to you with some details on how you can go about taking a look over the next couple of days – watch this space!


#24

Great, sounds good! Looking forward to it…


#25

HI @ewels,

In an effort to make things as easy as possible, we’ve implemented a clusterflow customization feature. To get things going, all that should be required is to enter clusterflow in the “Additional features to enable” field.

This will spin up a cluster as usual and then additionally:

  • install Cluster Flow via the installation of a Gridware depot, pulling in Cluster Flow v0.6.0-20170502 (this is a build from GitHub master branch as of 2nd May) along with all required tools.
  • download and extract the ngi-rna_test_set from S3.
  • configure the ngi-rna_test_set as a test genome (as per your docs).

Once logged in, users can then immediately load the apps/clusterflow module. Note that this currently also loads all the required tools. The Gridware tool doesn’t provide a way to express runtime dependencies for a package without also configuring the package to load the dependencies when the package itself is loaded (as it’s not expecting tools to have Environment Modules built into them!).

When the Cluster Flow module is loaded first time, it will perform an additional step that detects the scheduler and the number of cores/amount of RAM of the compute nodes (currently this supports GridScheduler and Slurm) and writes this information to the system-wide clusterflow.config file.

At this stage, everything should be pretty much ready to go! Users may want to proceed with configuration of their own clusterflow.config to set up their email address etc.

Some caveats:

  1. The detection routine requires a compute node to be running when the Cluster Flow module is loaded for the first time. This is because it uses the information that the scheduler has on the running node to detect the number of cores and amount of RAM available on a compute node. If no nodes are running, this step will probably break. :slight_smile: A possible enhancement is to skip the detection routine if no nodes are currently running.
  2. Cluster Flow submits jobs to GridScheduler using -l h_vmem <mem>. When submitting to a parallel environment (i.e. using -pe smp <cores>) the current value provided by Cluster Flow can be problematic as GridScheduler expects this to be memory per slot rather than memory in total for the job. This can lead to “stuck jobs” when, for example, a c4.large node (which only has 3.5GiB of RAM) is asked to process a 2-slot job with 3GB per slot.
  3. The customization feature is currently only available in eu-west-1 (Ireland) as it’s using yet-to-be-released Gridware binaries.

Give it a go and let us know how you get on!


#26

Fantastic, thanks @mjtko! I’ll give it a go in the coming days.

Not sure that we need to download and configure the test dataset / genome every time, this may clutter things a little for users. But I guess it doesn’t hurt at this point :slight_smile:

I’ll let you know once I’ve had a play (and will probably update the docs accordingly when I do so).

Cheers,

Phil


#27

Yup, fair enough, this was just to make things easier while we were testing things! We could extract this to an additional feature profile, say, clusterflow-test-data.

I think the main remaining issue is the memory figure used when submitting the job. For SGE this has to be per slot (which maps to cores), though for Slurm it can either be “per CPU” (--mem-per-cpu, which also maps to cores) or for the job (--mem). Note that I’ve not looked into how other schedulers that CF supports handle this.


#28

So the defaults that are hardcoded in Cluster Flow are currently --mem for SLURM (here) and -l h_vmem for SGE (here). LSF uses -M and -R "rusage[mem={{mem}}]" (here).

These are just from what people have suggested to me in the past / what I’ve used and seemed to work, so I’m not saying that they’re correct necessarily…! Does this help? Do you think we’ll need to use something different?

Phil


#29

I think for GridScheduler/SGE some arithmetic needs to be applied – perhaps take the memory figure, “dehumanize” it into KiB or even B and divide by the number of cores to be used. Otherwise, when -pe <env> <N> and -l h_vmem=<M> is used, GridScheduler will wait for nodes with N*M memory capacity available (which sometimes won’t be any, cos <M> may hit the ceiling of the value configured in clusterflow.config).

Slurm is fine as is (when using --mem); I’m not familiar with how LSF deals with this. Probably in a similarly sane way to Slurm though! :slight_smile:


#30

Hi @mjtko,

I just spun up an AF cluster and it worked a treat! First time, with no problems - amazing :slight_smile: I updated my docs accordingly.

I’ve also been working on the S3 reference genomes recently - Amazon gave us a grant to pay for the hosting, so it should be around for a year now. I’m now trying to make this into an easy-to-use resource with accompanying code. For example, I’d like to make it easier to fetch genomes by providing a script to run interactive command-line tools to fetch references. I just made a start here, but it doesn’t work yet: https://github.com/ewels/AWS-iGenomes - when this is done I think it’ll be super cool and really helpful for this setup.

The detection routine requires a compute node to be running when the Cluster Flow module is loaded for the first time.

Yup, I came across this. In the docs I added an instruction to fire off an empty job if no nodes are running so that some are created before loading CF. It’s a bit of a pain as you have to wait a few minutes, but it seems to work ok. I take it that there’s no way to pass on information about the Cluster config to the environment?

The customization feature is currently only available in eu-west-1 (Ireland) as it’s using yet-to-be-released Gridware binaries.

This is fine for now - that’s also where I run everything and where the above AWS-iGenomes S3 bucket is. So no rush from my end here.

-l h_vmem SGE problem

Coming back to this - the default environment from my old cluster was orte. Is this different to smp? In other words, should I always do this division for SGE jobs, or is it only with specific setups? I guess I can either hardcode this behaviour or set it to only happen when a config option is specified. Apologies, I’m a complete novice when it comes to this stuff.

Thanks again for your work on this - I’m amazed at how easy it is to run now! Once the dust has settled a little more I will polish the docs and record a screencast for the clusterflow website I think.

Phil


#31

Quick update - I now have a basic s3 iGenomes download script working, details available here: https://ewels.github.io/AWS-iGenomes/

I’ve updated the CF Alces Flight docs accordingly. Few more updates to come (check that the ref type actually exists for this build, detect CF installation and add to config automatically), but seems to be working pretty well already.

Let me know if you get a chance to try it out!

Phil


#32

Hi @mjtko,

I’ve just updated Cluster Flow v0.6dev to have a new config option, cluster_mem_per_cpu. This instructs CF to divide the job memory requirement by the number of cores requested. The underlying CF module is still told the total amount of memory available.

You can use it in a CF config file as follows:

@cluster_mem_per_cpu	true

I hope this does what you want with the SGE setup! Let me know if you run into any problems with it.

Phil


#33

Hi @mjtko,

Have you had a chance to update the Cluster Flow config with these changes yet?

Cheers,

Phil


#34

Hi @ewels,

Quick note to say that we’ve been working towards a new release of Flight Compute, so this has been on the back burner for a bit! We did have some limited success with the new configuration setting in as much as while it was working, it was rounding the memory values up which meant that for instances with lower memory capacities jobs weren’t reliably being scheduled.

I’ll be looking to have another go with this in the near future. We also have an impending release of the Alces Gridware main repository, and it’d be great to get ClusterFlow in there in time for that.


#35

Ok great! Ping me if you need any updates or changes… (or I guess a release would be appropriate if it works as required).

Phil


#36

Thanks Phil, will do!