Module Request: Cluster Flow


#1

Hi there,

I’ve written a bioinformatics pipeline tool called Cluster Flow, designed to run on queueing systems such as SGE and SLURM etc. It looks like it should work well with Alces Flight (it supports environment module loading out of the box).

Website and docs here: http://clusterflow.io/
Code and download is here: https://github.com/ewels/clusterflow

Many thanks in advance,

Phil


#2

Having thought about it a bit more, this may not be suitable as the installation requires some basic config such as which queue software is being used. Fine if the cluster is always built the same but could be tricky if it varies.


#3

Hi Phil,

Thanks for the suggestion, its certainly possible to include that sort of package, Flight has multiple methods of getting things configured so hopefully one may be suitable for your project - we’ll look into making this happen and get back to you.

Cheers

Steve


#4

Ok great! The package itself is pretty simple - just a case of copying the files and adding the directory to the PATH. It will then need a central configuration file to make it work. Two things specificailly:

  • The cluster environment, eg. specifying GRIDEngine. See the example config
  • Environment Module aliases. Cluster Flow works best when it can module load the tools it needs. In some cases it will need aliases to make sure that the module can be loaded. See the CF docs.

These should be the same for every user (assuming a comparable cluster setup). Then each user can run cf --setup to generate a user specific config file on top of this, with stuff like an e-mail address for notifications and other things.

One extra thing - CF just assumes that everything is available through the environment module system, but AF needs the extra tool installation step before they’re there. So there’s scope for some kind of extension to do that automatically when the module load command is run. This isn’t currently possible but I could have a look into it if you’re interested.

Phil


#5

Looks like it’s pretty straightforward to install. The main trick is going to be working out how to generate the config file for the current Flight Compute environment (what scheduler is configured etc). That might be feasible by using some kind of detection when loading the CF modulefile itself. i.e. generate an appropriate ~/.clusterflow/clusterflow.config file for the current environment (if one doesn’t already exist).

Re dependencies: it would probably makes sense to install all the possible dependencies when CF is installed; Gridware packages can specify other Gridware packages that they rely on. Is there a definitive list of what packages and modulefiles (or aliases) are expected or is it a case of looking through the Perl scripts in the modules/ subdirectory?

Are there many workflows that only require a subset of the tools, i.e. are some of the packages considered more “core” or “optional” than others or is it usually the case that all the packages are going to be required?


#6

That’s an interesting idea - I’d never really thought about trying to auto-detect the environment from within Cluster Flow. It could work, though to be honest with the current setup it can often take a bit of tinkering to get it to run properly. For instance, I’ve never managed to get it to run on PBS, even though it should be pretty similar to SGE. However, I don’t have anything against trying to guess if nothing is specified. Any suggestions on how to do this?

Most tools are only needed for a subset of pipelines, but to be honest the list of supported tools isn’t so huge that it would be infeasible to install them all (especially given how fast the AF installer scripts seem to be). I think that seems like the simplest and best way to go.

Doesn’t look like I’ve ever put together a list of the environment modules. Just added to the testing script to print this information though. This is the list of names used in the module load commands:

  • BEDTools
  • STAR
  • bismark
  • bowtie
  • bowtie2
  • bwa
  • cutadapt
  • deepTools
  • fastq_screen
  • fastqc
  • hicup
  • hisat2
  • htseq
  • kallisto
  • multiqc
  • phantompeakqualtools
  • picard
  • preseq
  • rseqc
  • samtools
  • sratoolkit
  • subread
  • tophat
  • trim_galore

Phil


#7

I was thinking that it would be a Gridware responsibility rather than a CF code change (that is to say it would take place in the Gridware modulefile for Cluster Flow). Not put a huge amount of thought into it yet TBH! Perhaps some kind of script that hooks into the Clusterware configuration to query what scheduler has been selected. (Aside: I also spotted a requirement for a GridScheduler orte PE which Flight Compute doesn’t provide, so there might need to be some changes there too.)

Thanks for the list of modules. I think Gridware packages already exist for most of them; hicup, hisat2 and deeptools are ones that jump out as packages that we’ve not encountered before.


#8

Ok great - yes I suspect that if the config is built on the Gridware side of things then it’s much more likely to work properly :slight_smile:

Missing tools that you mention are:

(apologies for the spaces in links, the forum software really doesn’t like me posting them).

As for a next step - I think it probably makes sense for me to set up a Alces Flight cluster myself and try to get Cluster Flow running manually (installation and config). Once I’ve done this I can report back with the steps I took and the config I used. You can then look into this and think about generalising for a Gridware installation. Would this be a good approach?

Phil


#9

Thanks for the further information about the tools; we’ll take a look at getting them scheduled in to a future Gridware release.

Yup, getting Cluster Flow running on a Flight Compute cluster sounds like a good next step!


#10

Hi all,

I had a play with Cluster Flow on Alces Flight today and got it working in the end. It’s not super simple unfortunately, but with a bit of magic I hope that it may be possible.

I started off with the intention of writing some documentation, but it ended up being just a list of commands that I used (with comments). You can see that here: https://github.com/ewels/clusterflow/blob/alces-flight/docs/alces-flight.md

The Cluster Flow config file I used is here: https://github.com/ewels/clusterflow/blob/alces-flight/clusterflow_aws.config

Note that a bunch of things in that config file could be set automatically on your end, which would be helpful. Mainly @total_cores and @total_mem, so that modules don’t request more resources than are available (I had this - jobs just sat in the queue for ever).

One thing I wasn’t expecting is that you guys have your own version of the environment module system, so the built in Cluster Flow stuff doesn’t work (it calls modulecmd perl load XXX and evals this). Not sure if we can get around this without just installing and adding everything to the node! Maybe that’s easiest.

I found a few things that needed fixing during this, it’s really helpful to have a blank cluster to test on! That’s always been very difficult when developing this tool.

Anyway, let me know what you think.

Phil


#11

Hi Phil,

Great to hear that you got it working. A couple of points/questions:

  • The command list URL doesn’t seem to be the correct one… :smile:
  • For the total_cores and total_mem values – are these aggregate across the whole cluster, for an individual node in the queue or what needs to be requested “per slot” for the selected cluster_queue_environment?
  • There is a modulecmd binary available as part of the Clusterware installation – you can find it in /opt/clusterware/opt/modules/bin/modulecmd. I think this should be suitable for use by Cluster Flow, although there will likely need to be a bunch of aliases set up in order to make use of the naming scheme used by Alces Gridware.

That’s really good to hear. Using your own, dedicated Flight Compute environment on AWS for developing and testing cluster tools is definitely easier than waiting for a shared cluster to quieten down!


#12
  • Oops, sorry! URL edited above (too much multitasking :wink: ) This is the correct one: https://github.com/ewels/clusterflow/blob/alces-flight/docs/alces-flight.md
  • total_cores and total_mem - this should be what’s available per compute node. Each module requests a certain amount of memory and cores, if that’s more than is available then Cluster Flow ignores this and uses the max available.
    • Not sure what “per slot” is exactly - basically if I can only submit a qsub job with max 16 cpus (anything else wouldn’t run), then total_cores should be 16.
  • Ok, great to hear! I had a bit of a dig around but couldn’t find it - I’ll have another play with this then. I guess we could add this to the PATH when Cluster Flow is loaded? Or I could add a config option to specify the path if necessary.
  • I was thinking that I should probably refactor the environment module naming system. Adding aliases is fine, but it’d be nicer to have this in a dedicated config file really.

For testing it’s more than just busy clusters, more having any access to a cluster with a particular queue system. I first worked on this tool in my old job where we ran GRIDEngine, but now I only have access to a SLURM cluster. Makes fixing system-specific bugs almost impossible…

Phil


#13

Haha, thanks, I know how you feel! :smiley:

Ok, gotcha, so this value should be the number of cores and amount of RAM for the selected compute node type.

I guess it might make more sense to use the smp PE rather than the mpislots PE if none of the ClusterFlow jobs make use of MPI (ref: http://docs.alces-flight.com/en/stable/sge/sge.html#running-multi-threaded-jobs) – there’s no guarantee that jobs submitted to the mpislots PE won’t be distributed across multiple nodes while the smp PE enforces this. However, note:

Memory limits are enforced per CPU-core-slot; for example, if your default memory request is 1.5GB and you request -pe smp 4, your 4-core job will be allocated 4 x 1.5GB = 6GB of RAM in total.

Does ClusterFlow calculate and provide a -l h_vmem value? I ask because, given the above, when using the smp or mpislots PEs this value would need to be “total amount of RAM needed for the job / number of slots requested”.

Yup, I don’t see any reason why not.

Naming things is one of the two hardest computer science problems! :wink:


#14

If you think that’s better then we can use that. I added the environment as a config option when I was working through the above, so it’s just a config change.

Yes it does - this is how the memory requirements are used. The default job submission command is:

echo "/path/to/custerflow.module --params etc" | qsub -clear -b n -cwd -V -S /bin/bash -pe {{pe_env}} {{cores}} {{qname}} -l h_vmem={{mem}} -o {{outfn}} -j y -N {{job_id}}';

I’ll try to find time to refactor the module naming stuff soon. Though this is a low priority as it’s possible to do already. When we’re a bit further along I’ll add these to the alces config file.

Do my mangled bash-command notes make any sense? Do you see any problems there?

Phil


#15

Hi @mjtko,

I’ve just run through the process a second time and cleaned up my docs considerably (and written about how to set up an Alces Flight cluster - mostly so that I remember next time!). See https://github.com/ewels/clusterflow/blob/alces-flight/docs/alces-flight.md

I added a step to softlink modulecmd and it seems to work brilliantly :slight_smile: I also added commands to install all software packages currently supported by Alces Flight, and updated the config file to have module aliases for all of the package names so that they load properly. I did a quick test run and it seems to work exactly as expected.

There are a few packages missing from Alces Flight, I couldn’t find the following:

Seriously impressive that it’s so few!

Anyway, my hope is that we can wrap everything under Section 2 in the readme into the Alces Flight install somehow. If that’s possible, it’ll be ridiculously simple to run a full analysis - especially if I can set up some kind of a public reference genome resource so that people don’t have to create their own every time.

Let me know what you think!

Phil


#16

ps. I’m currently preparing a revision for our recent paper on Cluster Flow: https://f1000research.com/articles/5-2824/v1

I was thinking that it would be great to add a section about Alces Flight. Would this be ok? Do you guys have anything in particular that you’d like me to cite? (If in doubt I’ll just put your URL).

I’m hoping to submit this soon after Easter. I’m guessing that we won’t be done with this by then (would be fantastic if we are), but it shouldn’t make too much difference - I’ll just point people towards the documentation I’ve written above.

Phil


#17

Hi Phil,

Hope you had a great Easter break!

Great, thanks. We’ll take a look at what we can do to provide some further automation.

Some of those are available in the volatile Gridware repository (preseq, rseqc, multiqc), though we don’t recommend enabling this repository without understanding the implications! We’ll look at propagating the packages we already have through to the main repo during the next Gridware release and produce some Gridware packages for the remaining outstanding dependencies: deepTools, hicup, hisat2 and salmon.

Sure, that’s fine by us. Citing us by URL is fine, but if there’s something that would suit better, let us know!


#18

Great stuff, thanks @mjtko!

Third time’s the charm - just had another run through the docs and simplified a couple of bits. Also merged to master and put everything into a collected into a directory: https://github.com/ewels/clusterflow/tree/master/alces-flight

Note that I couldn’t get the preseq and rseqc volatile packages to install properly (ERROR: Unable to download source.).

I think I’ll probably push ahead with the manuscript as it’s not clear how long it’ll be before all of this settles down. This is the draft text for the relevant paragraph:

Cloud-computing is becoming an increasingly practical solution to the requirements of high-throughput bioinformatics analyses. Unfortunately, the world of cloud solutions can be confusing to newcomers. We are working with the team behind Alces Flight (http://alces-flight.com) to provide a quick route to using the Amazon AWS cloud. Alces Flight provides a simple web-based tool for creating elastic compute clusters which come installed with the popular Open Grid Scheduler (SGE). Numerous bioinformatics tools are available as environment modules, compatible with Cluster Flow. We hope that Cluster Flow will soon be available and preconfigured as an such an app, allowing a powerful and simple route to running analyses in the cloud in just a few minutes with only a handful of commands.

I’m not sure that there’s much more I can do for now, so I’ll leave this for a bit. Shout if there’s anything I can do to help! Meanwhile I’ll keep working on the s3 reference genomes stuff (mentioned in the link above) and see if I can get anywhere with that (currently > $4 per day, so not sustainable).

Phil


#19

ps. Question from the bottom of that link that you may know the answer to: if the head node instance is running a micro instance, am I right in thinking that it should be free to keep alive on the free tier? In that case, could people create an elastic cluster with that as the head node, minimum compute nodes as 0, then just leave the head node running forever?

I guess that Alces Flight attaches an EBS instance though, which is not free? And probably other stuff that I haven’t thought of…


#20

Looking good!

We’ll look into it – it’s possible that, since the packages were initially created, the source location URL has changed or the source archive content has been updated/modified (leading to a checksum validation failure). Such is the nature of the volatile repo! The real solution here is to propagate these packages through to the main repo and generate some binary packages.

The draft manuscript text looks great! Regarding S3 storage, at $4 per day you must be looking at storing a large (~5TiB?) quantity of reference genomes. It might be worth contacting AWS to talk about what you’re storing; their research and education and public data set programs come to mind.

Forever is a long time! :smile: On the free tier, users are granted a year’s worth of runtime on a micro instance, so that should be free to keep alive for the first year. EBS up to 30GB is also included, so as long as the instance isn’t started with a disk larger than that, it should also fall into the free tier. In general, costs should be minimal when running in this way – things to look out for include data transfer fees, but an amount of data transfer is also included. For researchers, the research and education is definitely worth looking into.

The best reference for all these details is the AWS Free Usage Tier FAQs page.