Opened 4 months ago

Last modified 8 weeks ago

#1259 new task

Add support for Slurm

Reported by: Nicklas Nordborg Owned by:
Priority: critical Milestone: Job scheduler extension v1.4
Component: net.sf.basedb.opengrid Keywords:
Cc:

Description

Slurm (https://slurm.schedmd.com) is a workload manager that will probably be used on the new cluster instead of the Open Grid Engine. It has similar functionality and it should not be too difficult to implement support for Slurm for the things that we currently use in Open Grid Engine.

Basically, we need to replace the commands that we use:

  • qsub --> sbatch and srun
  • qstat and qacct --> squeue and sacct
  • qdel --> scancel

Scripts that are submitted may need some minor modifications. Slurm is setting different environment variables than Open Grid. In some cases it may be possible to simple make copy, for example:

  • Slurm is setting number of CPUs in SLURM_JOB_CPUS_PER_NODE which we can copy to NSLOTS.
  • Job priority values have a different range and sign (negative priority in Open Grid and positive 'niceness' in Slurm).
  • Options for specifying wanted number of slots in Open Grid has a range (smp 8-16), but in Slurm we need to ask for a specific number (--cpus-per-task).
  • Slurm seems to lack automatic assignment and cleanup of temporary directory

And possible some more things will be found when starting to implement this...

Change History (13)

comment:1 Changed 4 months ago by Nicklas Nordborg

In 5981:

References #1259: Add support for Slurm

Added support for configuring Slurm cluster in the opengrid-config.xml file by setting <cluster type="slurm" ..>. The major difference is that a different default command for getting information about the slurm version is used (sinfo -V).

Added interface ClusterEngine to handle this and hopefully also other differences in the implementation. But the design may change...

comment:2 Changed 4 months ago by Nicklas Nordborg

In 5982:

References #1259: Add support for Slurm

Implemented a way to submit jobs to Slurm. Options are mostly hard-coded since what goes into the JobConfig is Open Grid centric. We will either need some kind of translation between options or actual clients (eg. Reggie) must be updated and understand the differences.

The Slurm engine is currently generating a script for debugging that exits before the actual script is reached.

comment:3 Changed 3 months ago by Nicklas Nordborg

In 5984:

References #1259: Add support for Slurm

Implemented support for getting status information about waiting, running and completed jobs on a Slurm cluster. The generated script is still a no-op script, but it seems like the regular flow is working as expected. Jobs can be submitted, they are started when there is a free slot and progress information is returned as epxected. It is possible to manually abort both a waiting and running job. Errors seems to be handled as expected.

comment:4 Changed 3 months ago by Nicklas Nordborg

In 5985:

References #1259: Add support for Slurm

Temoprary directory handling is moved to the 'srun' script since it seems like environment variables are not updated until then (but it may depend on actual configuration of Slurm and implementation of prolog scripts).

comment:5 Changed 3 months ago by Nicklas Nordborg

In 5986:

References #1259: Add support for Slurm

Added cluster type to the list when viewing the configuration for clusters.

Changed names and titles to use "Job scheduler" instead of "Open Grid scheduler" now that it can handle Slurm clusters as well.

comment:6 Changed 2 months ago by Nicklas Nordborg

In 5987:

References #1259: Add support for Slurm

Implemented support for setting 'nice' value for jobs on Slurm cluster. A simple conversion is made between Open Grid 'priority' and Slurm 'nice' values. The ranges are different:

  • Open Grid: -1023..+1024 (low -> high)
  • Slurm: -2147483645..+2147483645 (high -> low)


A client may call either setPriority() or setSlurmNice() and the other value is automatically updated.

comment:7 Changed 2 months ago by Nicklas Nordborg

In 5988:

References #1259: Add support for Slurm

Changed message that is displayed while connecting to clusters.

comment:8 Changed 2 months ago by Nicklas Nordborg

In 5989:

References #1259: Add support for Slurm

Implemented support for setting options that go into the sbatch file.

comment:9 Changed 2 months ago by Nicklas Nordborg

In 5990:

References #1259: Add support for Slurm

Added options that should be ignored since if they are set by the user they may interfere with other things.

comment:10 Changed 2 months ago by Nicklas Nordborg

In 5991:

References #1259: Add support for Slurm

Implemented a simple solution for translating options between Open Grid and Slurm. So far, the implemented translation is between 'pe' and 'cpus-per-task' since that is the only option we use in Reggie.

comment:11 Changed 2 months ago by Nicklas Nordborg

In 5994:

References #1259: Add support for Slurm

Updated some documentation.

comment:12 Changed 2 months ago by Nicklas Nordborg

In 5995:

References #1259: Add support for Slurm

Added support for filtering when retrieving configured clusters.

comment:13 Changed 8 weeks ago by Nicklas Nordborg

Milestone: Open Grid Scheduler extension v1.4Job scheduler extension v1.4

Milestone renamed

Note: See TracTickets for help on using tickets.