Opened 8 years ago
Closed 8 years ago
#904 closed task (fixed)
Implement an Open Grid Scheduler service extension
Reported by: | Nicklas Nordborg | Owned by: | Nicklas Nordborg |
---|---|---|---|
Priority: | critical | Milestone: | Open Grid Scheduler service v1.0 |
Component: | net.sf.basedb.opengrid | Keywords: | |
Cc: |
Description (last modified by )
Reggie already has a lot of functionality for submitting and monitoring jobs on an Open Grid Scheduler cluster. In the future we may want to do the same from other extensions. For a number of reasons it is not a good idea to simply duplicate the functionality from Reggie:
- Each extension would need it's own monitoring service. It would be better to have a single service for this.
- Each service need to be able to identify which jobs it should monitor. Could for example be done by defining multiple queues on the cluster. It would be better to not require any special configuration on the cluster.
- Duplicating configuration files for accessing the cluster (though this also has the benefit that different user accounts can be used).
The new extension should include functions for submitting, monitoring and aborting jobs. Some sort of notification system must be implemented so that the extension that submitted the job is notified when the job has finished.
Reggie is also using the cluster for some simple operations not related to the Open Grid Scheduler. For example the "Check data files" functionality which uses a single node to execute a script immediately and the "Auto-analyze" functionality to monitor sequencing progress which is piggy-backed onto the Open Grid Scheduler monitoring but is executing a different script.
It will not be easy to untangle this and decide which functionality that should go where. We need to think about the details a bit more and we also need more support in the BASE core http://base.thep.lu.se/ticket/2027.
See also #905.
Change History (52)
comment:1 by , 8 years ago
Component: | not classified → net.sf.basedb.opengrid |
---|---|
Milestone: | → Open Grid Scheduler service v1.0 |
comment:2 by , 8 years ago
Description: | modified (diff) |
---|
comment:3 by , 8 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:4 by , 8 years ago
comment:5 by , 8 years ago
comment:6 by , 8 years ago
comment:7 by , 8 years ago
comment:8 by , 8 years ago
(In [4067]) References #904: Implement an Open Grid Scheduler service extension
Implemented possibility for connecting to a remote server via SSH and execute a command on it. Most functionality is taken from the existing Reggie implementation with some refactoring for setting up connection settings. The code may undergo more refactoring in the future when more functionality is being developed.
comment:9 by , 8 years ago
comment:10 by , 8 years ago
comment:11 by , 8 years ago
comment:12 by , 8 years ago
(In [4124]) References #904: Implement an Open Grid Scheduler service extension
Re-factored the file transfer code a bit to make setting/getting file metadata a bit more predicatable. the old implementation didn't set metadata until after the transfter, it should now happen before the transfer.
Added generic file transfter implementations for InputStream
and OutputStream
instances.
comment:13 by , 8 years ago
comment:14 by , 8 years ago
(In [4130]) References #904: Implement an Open Grid Scheduler service extension
Started to work on the job submission to the cluster. Most of the code is carried over from Reggie. So far the job script and any files will be transfered to the cluster but the actual job submission is not yet implemented.
comment:15 by , 8 years ago
comment:16 by , 8 years ago
(In [4203]) References #904: Implement an Open Grid Scheduler service extension
Started to link jobs submitted to a cluster with jobs regisered in BASE already in the JobDefinition
. This will cause OpenGridSession.submitJobs()
to update everything that is required on the BASE job in order to set up status update, abort handling etc.
The new classes in the net.sf.basedb.opengrid.service
package should take care of status updates and aborting jobs, but they are so far only skeletons without functionality.
comment:17 by , 8 years ago
(In [4205]) References #904: Implement an Open Grid Scheduler service extension
The service can now abort jobs. Lots of remaining work for getting status updates. We also need to figure out how clusters should be configured and how to keep track of them. Should the configuration be made in this extension or by other extensions that are actually using the clusters? It's important to solve problems related to starting/stopping the service and how updates are handled when everything needs to be reconfigured.
comment:18 by , 8 years ago
(In [4212]) References #904: Implement an Open Grid Scheduler service extension
Started to work with getting status information from waiting/running jobs using the qstat
command. Parsing the XML data requires JDom library. Also added support for getting current date+time from the cluster since we may need to correct for time differences between the BASE server and the cluster. The CmdResult
class has been refactored to support returning a result that has been parsed and generated from the stdout data (eg. a Date or a list of JobStatus
information).
comment:19 by , 8 years ago
(In [4222]) References #904: Implement an Open Grid Scheduler service extension
Job progress is now updated as the job is executing. Finished jobs are detected and registered correctly with ERROR or DONE. Error messages from stderr and result processing from successful have not been implemented yet. We must also remember to limit how often status updates are requested from the server. Doing it too often (eg. every 5s) will result in multiple 'qacct' that fails after a job has ended since the information has not been recorded yet.
comment:20 by , 8 years ago
(In [4226]) References #904: Implement an Open Grid Scheduler service extension
Limiting status updates to once every minute (30 seconds when debugging). Do not set it to a shorter interval since it may cause jobs to be reported as not found due to the delay for finished jobs to appear in the 'qacct' result.
Aborting jobs is still done every 5 seconds.
comment:21 by , 8 years ago
(In [4229]) References #904: Implement an Open Grid Scheduler service extension
The stderr file is now retreived for jobs that fail.
Added a 'fail immediately' option (on by default) to the JobDefinition
class. This is the easiest way to make sure errors are reported back to BASE unless implmenting error handling in the bash script itself.
comment:22 by , 8 years ago
comment:23 by , 8 years ago
comment:24 by , 8 years ago
(In [4254]) References #904: Implement an Open Grid Scheduler service extension
Started to re-design the configuration system. The service part of the extension will now read settings for Open Grid clusters from opengrid-config.xml
.
The idea is to somehow link the settings with job agents to be able to control usage permissions to the clusters. Even if only a single cluster exists this would be useful for example to schedule jobs for two different BASE projects to different user accounts on the Open Grid system.
comment:25 by , 8 years ago
(In [4255]) References #904: Implement an Open Grid Scheduler service extension
Added support for link Open Grid clusters to job agents defined in BASE. Simply set the <job-agent-id>
setting in opengrid-config.xml
to the same string as the External ID
field of a job agent in BASE. This will link the cluster to the job agent. The logged in user must have USE permission on the job agent in order to use the Open Grid cluster. No other settings on the job agent are used. Open Grid clusters without a link can be used by any user.
comment:26 by , 8 years ago
comment:27 by , 8 years ago
comment:28 by , 8 years ago
(In [4263]) References #904: Implement an Open Grid Scheduler service extension
The node name is now a separate property of the job. See http://base.thep.lu.se/ticket/2050
comment:29 by , 8 years ago
(In [4264]) References #904: Implement an Open Grid Scheduler service extension
Added configuration options for temporary/working folders. A job should typically use the {$TMPDIR} directory for storing data that is only needed during the job execution and that can be removed after the job has finished. It is up to the job scripts to actually use the temporary folder. It is provided here as a service only. It is also possible to set up a different folder for debugging which is not cleaned up afterwards.
comment:30 by , 8 years ago
comment:31 by , 8 years ago
(In [4267]) References #904: Implement an Open Grid Scheduler service extension
Added a configuration option JobConfig.setCreatePrivateFiles()
to make the script generator automatically include umask -S u=rwx,g=,o=
. This makes sure that all files that are created by the job script remain private. The option is ON by default.
comment:32 by , 8 years ago
(In [4270]) References #904: Implement an Open Grid Scheduler service extension
Improved handling of job work and temporary folders. When submitting jobs the ${WD}
is set to the work folder (used for getting files in/out between BASE and the cluster) and ${TMPDIR}
is set to a temporary location that the job can use. The ${TMPDIR}
is typically local to the node the job is executing on and not accessible from the outside. Files will automatically be removed after a job completes. A debug option can be specified to use a different folder (if configured in opengrid-config.xml) that is accessible from the outside and not deleted.
Also added a utility method OpenGridSession.getJobFileAsString()
that is intended to retreive a text file that was created by the job. Typically used for parsing some useful results back and store in BASE.
comment:33 by , 8 years ago
comment:34 by , 8 years ago
(In [4278]) References #904: Implement an Open Grid Scheduler service extension
Re-factored the OpenGridCluster/OpenGridSession
class and moved generic SSH functionality to other classes AbstractHost/AbstractSession
and RemoteHost/RemoteSession
. The ConnectionInfo
has been extended to allow a FileServer
as a parameter. The combined changes allow us to connect to any server with generic SSH functionality while keeping the Open Grid specific functionality in separate classes.
comment:35 by , 8 years ago
comment:36 by , 8 years ago
comment:37 by , 8 years ago
comment:38 by , 8 years ago
comment:39 by , 8 years ago
comment:40 by , 8 years ago
comment:41 by , 8 years ago
comment:42 by , 8 years ago
comment:43 by , 8 years ago
(In [4302]) References #904: Implement an Open Grid Scheduler service extension
Jobs that are aborted are no longer immediately set to ERROR status. This allows the regular error handling and job completion procedures to be used and makes it possible for other extensions to catch and trigger some action also when a job is manually aborted (and not just when the job crashes). It takes a bit longer since we have to wait an extra round for the singal/async processing to take place. In the meantime that status is updated to 99% with a message that the abort request has been registered.
comment:44 by , 8 years ago
comment:45 by , 8 years ago
(In [4310]) References #904: Implement an Open Grid Scheduler service extension
Added CmdResult.throwExceptionIfNonZeroExitStatus()
to make it easier to implement error handling that doesn't depend on if the error happens due to SSH connection/transport problems or due to command execution problems.
comment:46 by , 8 years ago
comment:47 by , 8 years ago
comment:48 by , 8 years ago
comment:49 by , 8 years ago
comment:50 by , 8 years ago
comment:51 by , 8 years ago
comment:52 by , 8 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
(In [4059]) References #904: Adding root folder for the Open Grid Scheduler service extension.