Opened 2 years ago
Closed 2 years ago
#1408 closed enhancement (fixed)
Add support for using Slurm without accounting enabled
Reported by: | Nicklas Nordborg | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | Job scheduler extension v1.7 |
Component: | net.sf.basedb.opengrid | Keywords: | |
Cc: |
Description (last modified by )
The current implementation of the Slurm engine uses the sacct
command to get information about finished jobs. However, this requires that accounting has been enabled in Slurm (it is disabled by default). It would be nice to have an alternative to use a status file similar to the "direct" engine. It should be relatively easy to let the current wrapper scripts create and update a status file. The drawback is that it will not catch problems that happens before the job is started, but this is probably rare and should only happen in case there is a misconfiguration on cluster.
This should be a configurable option for the <cluster>
entry in opengrid-config.xml
. The default should be to use accounting.
Change History (7)
comment:1 by , 2 years ago
Description: | modified (diff) |
---|
comment:2 by , 2 years ago
comment:3 by , 2 years ago
In [6827]:
Added support for setting custom options for <cluster>
definitions in the config file. For Slurm clusters it is possible to set the <slurm-accounting-disabled>1</slurm-accounting-disabled>
flag. This should trigger the use of a status-file for storing information about completed jobs instead of using the sacct
command (not yet implemented).
comment:4 by , 2 years ago
In [6829]:
A status file is now created if the accounting has been disabled. Handling manual abort is a bit tricky since the running script (in srun
) doesn't cause an exit code. The run.sh
must handle the TERM
signal and set the exit code to 137 (for compatibility with the Open grid engine).
comment:5 by , 2 years ago
In [6830]:
Sometimes there is an extra request for status information for a job that has already been registered as finished. This seems to happen if a request for a new status update comes in while the processAsyncRequests()
method is executing. The solution is to remove those new requests at the end of the method.
comment:6 by , 2 years ago
In [6831]:
Fixes an issue with cancelling jobs and trapping the signals that Slurm sends out. Something worked differently between my development machine and the new cluster.
comment:7 by , 2 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Accidentally checked in with references to ticket number
#14078
instead of #1408 in changesets:Log messages are copied in comments below.