Opened 2 years ago

Closed 22 months ago

#904 closed task (fixed)

Implement an Open Grid Scheduler service extension

Reported by: nicklas Owned by: nicklas
Priority: critical Milestone: Open Grid Scheduler service v1.0
Component: net.sf.basedb.opengrid Keywords:
Cc:

Description (last modified by nicklas)

Reggie already has a lot of functionality for submitting and monitoring jobs on an Open Grid Scheduler cluster. In the future we may want to do the same from other extensions. For a number of reasons it is not a good idea to simply duplicate the functionality from Reggie:

  • Each extension would need it's own monitoring service. It would be better to have a single service for this.
  • Each service need to be able to identify which jobs it should monitor. Could for example be done by defining multiple queues on the cluster. It would be better to not require any special configuration on the cluster.
  • Duplicating configuration files for accessing the cluster (though this also has the benefit that different user accounts can be used).

The new extension should include functions for submitting, monitoring and aborting jobs. Some sort of notification system must be implemented so that the extension that submitted the job is notified when the job has finished.

Reggie is also using the cluster for some simple operations not related to the Open Grid Scheduler. For example the "Check data files" functionality which uses a single node to execute a script immediately and the "Auto-analyze" functionality to monitor sequencing progress which is piggy-backed onto the Open Grid Scheduler monitoring but is executing a different script.

It will not be easy to untangle this and decide which functionality that should go where. We need to think about the details a bit more and we also need more support in the BASE core http://base.thep.lu.se/ticket/2027.

See also #905.

Change History (52)

comment:1 Changed 2 years ago by nicklas

  • Component changed from not classified to net.sf.basedb.opengrid
  • Milestone set to Open Grid Scheduler service v1.0

comment:2 Changed 2 years ago by nicklas

  • Description modified (diff)

comment:3 Changed 2 years ago by nicklas

  • Owner changed from jari to nicklas
  • Status changed from new to assigned

comment:4 Changed 2 years ago by nicklas

(In [4059]) References #904: Adding root folder for the Open Grid Scheduler service extension.

comment:5 Changed 2 years ago by nicklas

(In [4060]) References #904: Adding trunk and tags folders for the Open Grid Scheduler service extension.

comment:6 Changed 2 years ago by nicklas

(In [4061]) References #904: Adding license, readme and some other standard files.

comment:7 Changed 2 years ago by nicklas

(In [4062]) References #904: Adding build files with minimal configuration for generating a JAR file.

comment:8 Changed 2 years ago by nicklas

(In [4067]) References #904: Implement an Open Grid Scheduler service extension

Implemented possibility for connecting to a remote server via SSH and execute a command on it. Most functionality is taken from the existing Reggie implementation with some refactoring for setting up connection settings. The code may undergo more refactoring in the future when more functionality is being developed.

comment:9 Changed 2 years ago by nicklas

(In [4121]) References #904: Implement an Open Grid Scheduler service extension

Added support for uploading files to the remote server.

comment:10 Changed 2 years ago by nicklas

(In [4122]) References #904: Implement an Open Grid Scheduler service extension

Added support for downloading files from the remote server.

comment:11 Changed 2 years ago by nicklas

(In [4123]) References #904: Implement an Open Grid Scheduler service extension

Added file transfer support to/from remote servers for files on the BASE file system.

Added support for downloading a file on a remote server directly to a web browser without having to store it locally.

comment:12 Changed 2 years ago by nicklas

(In [4124]) References #904: Implement an Open Grid Scheduler service extension

Re-factored the file transfer code a bit to make setting/getting file metadata a bit more predicatable. the old implementation didn't set metadata until after the transfter, it should now happen before the transfer.

Added generic file transfter implementations for InputStream and OutputStream instances.

comment:13 Changed 2 years ago by nicklas

(In [4126]) References #904: Implement an Open Grid Scheduler service extension

Re-factored cluster configuration. Connection information is kept in one configuration object and other cluster setting are kept in ClusterConfig object. So far, the path to the job folder is the only setting here.

comment:14 Changed 2 years ago by nicklas

(In [4130]) References #904: Implement an Open Grid Scheduler service extension

Started to work on the job submission to the cluster. Most of the code is carried over from Reggie. So far the job script and any files will be transfered to the cluster but the actual job submission is not yet implemented.

comment:15 Changed 2 years ago by nicklas

(In [4197]) References #904: Implement an Open Grid Scheduler service extension

Submitting jobs to the cluster is now working. There is still now way to check status or to get information about running or finished jobs.

comment:16 Changed 2 years ago by nicklas

(In [4203]) References #904: Implement an Open Grid Scheduler service extension

Started to link jobs submitted to a cluster with jobs regisered in BASE already in the JobDefinition. This will cause OpenGridSession.submitJobs() to update everything that is required on the BASE job in order to set up status update, abort handling etc.

The new classes in the net.sf.basedb.opengrid.service package should take care of status updates and aborting jobs, but they are so far only skeletons without functionality.

comment:17 Changed 2 years ago by nicklas

(In [4205]) References #904: Implement an Open Grid Scheduler service extension

The service can now abort jobs. Lots of remaining work for getting status updates. We also need to figure out how clusters should be configured and how to keep track of them. Should the configuration be made in this extension or by other extensions that are actually using the clusters? It's important to solve problems related to starting/stopping the service and how updates are handled when everything needs to be reconfigured.

comment:18 Changed 2 years ago by nicklas

(In [4212]) References #904: Implement an Open Grid Scheduler service extension

Started to work with getting status information from waiting/running jobs using the qstat command. Parsing the XML data requires JDom library. Also added support for getting current date+time from the cluster since we may need to correct for time differences between the BASE server and the cluster. The CmdResult class has been refactored to support returning a result that has been parsed and generated from the stdout data (eg. a Date or a list of JobStatus information).

comment:19 Changed 2 years ago by nicklas

(In [4222]) References #904: Implement an Open Grid Scheduler service extension

Job progress is now updated as the job is executing. Finished jobs are detected and registered correctly with ERROR or DONE. Error messages from stderr and result processing from successful have not been implemented yet. We must also remember to limit how often status updates are requested from the server. Doing it too often (eg. every 5s) will result in multiple 'qacct' that fails after a job has ended since the information has not been recorded yet.

comment:20 Changed 2 years ago by nicklas

(In [4226]) References #904: Implement an Open Grid Scheduler service extension

Limiting status updates to once every minute (30 seconds when debugging). Do not set it to a shorter interval since it may cause jobs to be reported as not found due to the delay for finished jobs to appear in the 'qacct' result.

Aborting jobs is still done every 5 seconds.

comment:21 Changed 2 years ago by nicklas

(In [4229]) References #904: Implement an Open Grid Scheduler service extension

The stderr file is now retreived for jobs that fail.

Added a 'fail immediately' option (on by default) to the JobDefinition class. This is the easiest way to make sure errors are reported back to BASE unless implmenting error handling in the bash script itself.

comment:22 Changed 2 years ago by nicklas

(In [4234]) References #904: Implement an Open Grid Scheduler service extension

Added JobConfig which is used for setting parameters to 'qsub' when submitting jobs to a cluster.

comment:23 Changed 2 years ago by nicklas

(In [4245]) References #904: Implement an Open Grid Scheduler service extension

Implemented an extension point that is called to notify other extensions (eg. the extension that submitted the job) that a job has been completed.

comment:24 Changed 2 years ago by nicklas

(In [4254]) References #904: Implement an Open Grid Scheduler service extension

Started to re-design the configuration system. The service part of the extension will now read settings for Open Grid clusters from opengrid-config.xml.

The idea is to somehow link the settings with job agents to be able to control usage permissions to the clusters. Even if only a single cluster exists this would be useful for example to schedule jobs for two different BASE projects to different user accounts on the Open Grid system.

comment:25 Changed 2 years ago by nicklas

(In [4255]) References #904: Implement an Open Grid Scheduler service extension

Added support for link Open Grid clusters to job agents defined in BASE. Simply set the <job-agent-id> setting in opengrid-config.xml to the same string as the External ID field of a job agent in BASE. This will link the cluster to the job agent. The logged in user must have USE permission on the job agent in order to use the Open Grid cluster. No other settings on the job agent are used. Open Grid clusters without a link can be used by any user.

comment:26 Changed 2 years ago by nicklas

(In [4257]) References #904: Implement an Open Grid Scheduler service extension

Added support for getting some cluster information as JSON data.

Added a couple of more options/shortcuts for getting information about the cluster by executing some commands (uname and qstat).

comment:27 Changed 2 years ago by nicklas

(In [4258]) References #904: Implement an Open Grid Scheduler service extension

More JSON data instead of simple strings when returning cluster information.

comment:28 Changed 2 years ago by nicklas

(In [4263]) References #904: Implement an Open Grid Scheduler service extension

The node name is now a separate property of the job. See http://base.thep.lu.se/ticket/2050

comment:29 Changed 2 years ago by nicklas

(In [4264]) References #904: Implement an Open Grid Scheduler service extension

Added configuration options for temporary/working folders. A job should typically use the {$TMPDIR} directory for storing data that is only needed during the job execution and that can be removed after the job has finished. It is up to the job scripts to actually use the temporary folder. It is provided here as a service only. It is also possible to set up a different folder for debugging which is not cleaned up afterwards.

comment:30 Changed 2 years ago by nicklas

(In [4265]) References #904: Implement an Open Grid Scheduler service extension

Added isDefined() method for checking if a cluster with a given id has been defined in the configuration.

comment:31 Changed 2 years ago by nicklas

(In [4267]) References #904: Implement an Open Grid Scheduler service extension

Added a configuration option JobConfig.setCreatePrivateFiles() to make the script generator automatically include umask -S u=rwx,g=,o=. This makes sure that all files that are created by the job script remain private. The option is ON by default.

comment:32 Changed 2 years ago by nicklas

(In [4270]) References #904: Implement an Open Grid Scheduler service extension

Improved handling of job work and temporary folders. When submitting jobs the ${WD} is set to the work folder (used for getting files in/out between BASE and the cluster) and ${TMPDIR} is set to a temporary location that the job can use. The ${TMPDIR} is typically local to the node the job is executing on and not accessible from the outside. Files will automatically be removed after a job completes. A debug option can be specified to use a different folder (if configured in opengrid-config.xml) that is accessible from the outside and not deleted.

Also added a utility method OpenGridSession.getJobFileAsString() that is intended to retreive a text file that was created by the job. Typically used for parsing some useful results back and store in BASE.

comment:33 Changed 2 years ago by nicklas

(In [4275]) References #904: Implement an Open Grid Scheduler service extension

Added support for configuring some nodes in a cluster to be used for special actions. For example, checking that data files are present and valid.

comment:34 Changed 2 years ago by nicklas

(In [4278]) References #904: Implement an Open Grid Scheduler service extension

Re-factored the OpenGridCluster/OpenGridSession class and moved generic SSH functionality to other classes AbstractHost/AbstractSession and RemoteHost/RemoteSession. The ConnectionInfo has been extended to allow a FileServer as a parameter. The combined changes allow us to connect to any server with generic SSH functionality while keeping the Open Grid specific functionality in separate classes.

comment:35 Changed 2 years ago by nicklas

(In [4284]) References #904: Implement an Open Grid Scheduler service extension

Added the possibility to register "external" status updater implementations for things that need to use the cluster but are not cluster jobs. For example we can piggy-back checks for the NextSeq sequencing runs.

comment:36 Changed 23 months ago by nicklas

(In [4290]) References #904: Implement an Open Grid Scheduler service extension

Aborting jobs was not working since an incorrect job id was sent to the cluster.

comment:37 Changed 23 months ago by nicklas

(In [4294]) References #904: Implement an Open Grid Scheduler service extension

Documented the configuration file.

comment:38 Changed 23 months ago by nicklas

(In [4296]) References #904: Implement an Open Grid Scheduler service extension

Trying to make the documentation more complete and accurate.

comment:39 Changed 23 months ago by nicklas

(In [4297]) References #904: Implement an Open Grid Scheduler service extension

Implemented check for not overwriting qsub options set and required by the job definition.

Updated some documentation.

comment:40 Changed 23 months ago by nicklas

(In [4298]) References #904: Implement an Open Grid Scheduler service extension

Generate API documentation with javadoc.

comment:41 Changed 23 months ago by nicklas

(In [4299]) References #904: Implement an Open Grid Scheduler service extension

Removed references to test code that is no longer present.

comment:42 Changed 23 months ago by nicklas

(In [4301]) References #904: Implement an Open Grid Scheduler service extension

Fixed a lot of javadoc errors and warnings.

comment:43 Changed 22 months ago by nicklas

(In [4302]) References #904: Implement an Open Grid Scheduler service extension

Jobs that are aborted are no longer immediately set to ERROR status. This allows the regular error handling and job completion procedures to be used and makes it possible for other extensions to catch and trigger some action also when a job is manually aborted (and not just when the job crashes). It takes a bit longer since we have to wait an extra round for the singal/async processing to take place. In the meantime that status is updated to 99% with a message that the abort request has been registered.

comment:44 Changed 22 months ago by nicklas

(In [4309]) References #904: Implement an Open Grid Scheduler service extension

We should not try to connect to a Open Grid Cluster for non-Open Grid jobs that holds not reference to an existing cluster.

comment:45 Changed 22 months ago by nicklas

(In [4310]) References #904: Implement an Open Grid Scheduler service extension

Added CmdResult.throwExceptionIfNonZeroExitStatus() to make it easier to implement error handling that doesn't depend on if the error happens due to SSH connection/transport problems or due to command execution problems.

comment:46 Changed 22 months ago by nicklas

(In [4314]) References #904: Implement an Open Grid Scheduler service extension

Added an extension that inserts a "Configure" button in the services lists page for the Open Grid service. The page will display a list with some information about the configured clusters.

comment:47 Changed 22 months ago by nicklas

(In [4315]) References #904: Implement an Open Grid Scheduler service extension

Added a page for displaying detailed configuration settings for a single cluster.

comment:48 Changed 22 months ago by nicklas

(In [4317]) References #904: Implement an Open Grid Scheduler service extension

Load job agent information for Open Grid clusters that have been linked to a job agent and provide shorcut links to regular BASE pages (view/edit/share).

comment:49 Changed 22 months ago by nicklas

(In [4318]) References #904: Implement an Open Grid Scheduler service extension

Some minor changes and cleanup in the list/view pages.

comment:50 Changed 22 months ago by nicklas

(In [4319]) References #904: Implement an Open Grid Scheduler service extension

Updated to latest versions of SSHj and Bouncy Castle.

comment:51 Changed 22 months ago by nicklas

(In [4324]) References #904: Implement an Open Grid Scheduler service extension

Errors thrown by job completion handlers were only logged causing the jobs to be marked as DONE (Job completed). We want to rethrow them so that the job is marked with ERROR and make the stack trace available.

comment:52 Changed 22 months ago by nicklas

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.