Opened 4 years ago

Closed 3 years ago

#1259 closed task (fixed)

Add support for Slurm

Reported by: Nicklas Nordborg Owned by:
Priority: critical Milestone: Job scheduler extension v1.4
Component: net.sf.basedb.opengrid Keywords:
Cc:

Description

Slurm (https://slurm.schedmd.com) is a workload manager that will probably be used on the new cluster instead of the Open Grid Engine. It has similar functionality and it should not be too difficult to implement support for Slurm for the things that we currently use in Open Grid Engine.

Basically, we need to replace the commands that we use:

  • qsub --> sbatch and srun
  • qstat and qacct --> squeue and sacct
  • qdel --> scancel

Scripts that are submitted may need some minor modifications. Slurm is setting different environment variables than Open Grid. In some cases it may be possible to simple make copy, for example:

  • Slurm is setting number of CPUs in SLURM_JOB_CPUS_PER_NODE which we can copy to NSLOTS.
  • Job priority values have a different range and sign (negative priority in Open Grid and positive 'niceness' in Slurm).
  • Options for specifying wanted number of slots in Open Grid has a range (smp 8-16), but in Slurm we need to ask for a specific number (--cpus-per-task).
  • Slurm seems to lack automatic assignment and cleanup of temporary directory

And possible some more things will be found when starting to implement this...

Change History (14)

comment:1 by Nicklas Nordborg, 4 years ago

In 5981:

References #1259: Add support for Slurm

Added support for configuring Slurm cluster in the opengrid-config.xml file by setting <cluster type="slurm" ..>. The major difference is that a different default command for getting information about the slurm version is used (sinfo -V).

Added interface ClusterEngine to handle this and hopefully also other differences in the implementation. But the design may change...

comment:2 by Nicklas Nordborg, 4 years ago

In 5982:

References #1259: Add support for Slurm

Implemented a way to submit jobs to Slurm. Options are mostly hard-coded since what goes into the JobConfig is Open Grid centric. We will either need some kind of translation between options or actual clients (eg. Reggie) must be updated and understand the differences.

The Slurm engine is currently generating a script for debugging that exits before the actual script is reached.

comment:3 by Nicklas Nordborg, 4 years ago

In 5984:

References #1259: Add support for Slurm

Implemented support for getting status information about waiting, running and completed jobs on a Slurm cluster. The generated script is still a no-op script, but it seems like the regular flow is working as expected. Jobs can be submitted, they are started when there is a free slot and progress information is returned as epxected. It is possible to manually abort both a waiting and running job. Errors seems to be handled as expected.

comment:4 by Nicklas Nordborg, 4 years ago

In 5985:

References #1259: Add support for Slurm

Temoprary directory handling is moved to the 'srun' script since it seems like environment variables are not updated until then (but it may depend on actual configuration of Slurm and implementation of prolog scripts).

comment:5 by Nicklas Nordborg, 4 years ago

In 5986:

References #1259: Add support for Slurm

Added cluster type to the list when viewing the configuration for clusters.

Changed names and titles to use "Job scheduler" instead of "Open Grid scheduler" now that it can handle Slurm clusters as well.

comment:6 by Nicklas Nordborg, 4 years ago

In 5987:

References #1259: Add support for Slurm

Implemented support for setting 'nice' value for jobs on Slurm cluster. A simple conversion is made between Open Grid 'priority' and Slurm 'nice' values. The ranges are different:

  • Open Grid: -1023..+1024 (low -> high)
  • Slurm: -2147483645..+2147483645 (high -> low)


A client may call either setPriority() or setSlurmNice() and the other value is automatically updated.

comment:7 by Nicklas Nordborg, 4 years ago

In 5988:

References #1259: Add support for Slurm

Changed message that is displayed while connecting to clusters.

comment:8 by Nicklas Nordborg, 4 years ago

In 5989:

References #1259: Add support for Slurm

Implemented support for setting options that go into the sbatch file.

comment:9 by Nicklas Nordborg, 4 years ago

In 5990:

References #1259: Add support for Slurm

Added options that should be ignored since if they are set by the user they may interfere with other things.

comment:10 by Nicklas Nordborg, 4 years ago

In 5991:

References #1259: Add support for Slurm

Implemented a simple solution for translating options between Open Grid and Slurm. So far, the implemented translation is between 'pe' and 'cpus-per-task' since that is the only option we use in Reggie.

comment:11 by Nicklas Nordborg, 4 years ago

In 5994:

References #1259: Add support for Slurm

Updated some documentation.

comment:12 by Nicklas Nordborg, 4 years ago

In 5995:

References #1259: Add support for Slurm

Added support for filtering when retrieving configured clusters.

comment:13 by Nicklas Nordborg, 4 years ago

Milestone: Open Grid Scheduler extension v1.4Job scheduler extension v1.4

Milestone renamed

comment:14 by Nicklas Nordborg, 3 years ago

Resolution: fixed
Status: newclosed

There are probably still issues with Slurm depending on configuration of the Slurm cluster. It has only been tested in a development environment.

Note: See TracTickets for help on using tickets.