Opened 2 years ago

Closed 2 years ago

#1408 closed enhancement (fixed)

Add support for using Slurm without accounting enabled

Reported by: Nicklas Nordborg Owned by:
Priority: major Milestone: Job scheduler extension v1.7
Component: net.sf.basedb.opengrid Keywords:
Cc:

Description (last modified by Nicklas Nordborg)

The current implementation of the Slurm engine uses the sacct command to get information about finished jobs. However, this requires that accounting has been enabled in Slurm (it is disabled by default). It would be nice to have an alternative to use a status file similar to the "direct" engine. It should be relatively easy to let the current wrapper scripts create and update a status file. The drawback is that it will not catch problems that happens before the job is started, but this is probably rare and should only happen in case there is a misconfiguration on cluster.

This should be a configurable option for the <cluster> entry in opengrid-config.xml. The default should be to use accounting.

Change History (7)

comment:1 by Nicklas Nordborg, 2 years ago

Description: modified (diff)

comment:2 by Nicklas Nordborg, 2 years ago

Accidentally checked in with references to ticket number #14078 instead of #1408 in changesets:

Log messages are copied in comments below.

comment:3 by Nicklas Nordborg, 2 years ago

In [6827]:

Added support for setting custom options for <cluster> definitions in the config file. For Slurm clusters it is possible to set the <slurm-accounting-disabled>1</slurm-accounting-disabled> flag. This should trigger the use of a status-file for storing information about completed jobs instead of using the sacct command (not yet implemented).

comment:4 by Nicklas Nordborg, 2 years ago

In [6829]:

A status file is now created if the accounting has been disabled. Handling manual abort is a bit tricky since the running script (in srun) doesn't cause an exit code. The run.sh must handle the TERM signal and set the exit code to 137 (for compatibility with the Open grid engine).

comment:5 by Nicklas Nordborg, 2 years ago

In [6830]:

Sometimes there is an extra request for status information for a job that has already been registered as finished. This seems to happen if a request for a new status update comes in while the processAsyncRequests() method is executing. The solution is to remove those new requests at the end of the method.

comment:6 by Nicklas Nordborg, 2 years ago

In [6831]:

Fixes an issue with cancelling jobs and trapping the signals that Slurm sends out. Something worked differently between my development machine and the new cluster.

comment:7 by Nicklas Nordborg, 2 years ago

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.