Changes between Version 14 and Version 15 of net.sf.basedb.opengrid/using


Ignore:
Timestamp:
Feb 1, 2017, 1:59:01 PM (8 years ago)
Author:
Nicklas Nordborg
Comment:

Minor updates to the documentation

Legend:

Unmodified
Added
Removed
Modified
  • net.sf.basedb.opengrid/using

    v14 v15  
    88== The use case ==
    99
    10 The use case in this tutorial is that we are going to implement an extension that wants to display a web page inside BASE for starting a job on an Open Grid Cluster. We will concentrate on the Open Grid part of this and will not care about how the extension interacts with BASE. You may want to read more about this in the [http://base.thep.lu.se/chrome/site/latest/html/developer/extensions/index.html BASE documentation]. We will assume that you have been able to display a web page which allows the user to select input parameters for the job. On this web page you also want the user to be able to select an Open Grid Cluster that the job should be submitted to. The list below is an overview of the main steps that are needed.
     10The use case in this tutorial is that we are going to implement an extension that wants to display a web page inside BASE for starting a job on an Open Grid Cluster. We focus on the Open Grid part of this and care less about how the extension interacts with BASE. You may want to read more about this in the [http://base.thep.lu.se/chrome/site/latest/html/developer/extensions/index.html BASE documentation]. We assume that you have already implemented an extension that can display a web page that allows the user to specify input parameters for the job. On this web page you also want the user to be able to select an Open Grid Cluster that the job should be submitted to. The list below is an overview of the main steps that are needed.
    1111
    1212 1. Display a selection list with available Open Grid Clusters
    13  2. The web page submits the information to a servlet
     13 2. Submit the job parameters and other information to a servlet
    1414 3. The servlet creates a job script and submit it to the cluster
    15  4. When the job is completed we want to get a notification so that we can import files and other data back into BASE
     15 4. Get a notification once that job has been completed, so that we can import files and other data back into BASE
    1616
    1717== Enumerating Open Grid Clusters ==
    1818
    19 In this step we want to display a simple selection list on the web page that allows the user to select a pre-defined Open Grid Cluster.
    20 We recommend that the list is populated by !JavaScript code which uses AJAX, a servlet and JSON to retreive the information. You'll need to implement all of this in your own extension package.
     19In this step we want to display a selection list on the web page that allows the user to select a pre-defined Open Grid Cluster. We recommend that the list is populated by !JavaScript code that uses AJAX, a Servlet and JSON to retreive the information. You need to implement everything on the browser side and most of the servlet in your own extension package.
    2120
    2221The `OpenGridService` class is typically the starting point for a lot of actions. From this class it is possible to get information about and access all clusters that have been defined in the `opengrid-config.xml` file. The service is a singleton, use the `OpenGridService.getInstance()` method to get the instance. Note! It is important that the service is actually running inside BASE. Check the '''Administrate->Services''' page that this is the case, otherwise you will get an empty service.
     
    6766== Creating a job script ==
    6867
    69 In it's simplest form a job script is only a string with one or more (bash) commands to execute. For example, `pwd; ls` is a valid job script that will print the current directory and then list all files in it. To help creating longer and more complex scripts the `ScriptBuilder` class can be used. The `cmd()`, `echo()` and `comment()` methods are more or less self-describing. It is possible to start a command in the background with `bkgr()`, but not that this must be paired with a `waitForProcess()` otherwise the job script may finish before the commmand that is running in the background which may cause unpredictable results. The `progress()` method is a very useful method for jobs that are expected to take a long time to run. The method will write progress information to the `{$WD}/progress` file which will be picked up by the Open Grid Service and reported back to the BASE job that is acting as a proxy.
    70 
    71 When creating a job script there are a few useful variables that has been set up:
    72 
    73  * `{$WD}`: A randomly generated subdirectory in the `<job-folder>` directory. The directory contains the job script. This is also the current working directory when the job is started and the directory that is used for communicating data to/from the BASE server. Data in this directory is preserved after the job has finished. When running post-job code this folder can be found by calling `OpenGridCluster.getWorkFolder()`. Files can be transferred to the BASE server with `OpenGridSession.downloadFile()`, `OpenGridSession.readFile()` or `OpenGridSession.getJobFileAsString()`. The latter method is the simplest one to use for parsing out interesting data from text result files.
    74  * `{$TMPDIR}`: A temporary working directory that is typically only available on the node the job is running on. Unless the job is started in debug mode, this directory is deleted soon after the has been completed.
    75  * `{NSLOTS}`: The number of slots that has been assigned to this job. If the job is starting a multi-threaded analysis program it is common practice to not use more threads that what this value specifies. Note that a single node may run more than one job at the same time and that one slot typically corresponds to one cpu core.
    76 
    77 In the code example below we assume that we have FASTQ files stored on a file server on the network. The FASTQ files are going to be aligned with Tophat and we have a wrapper script that sets all parameters except the number of threads and the location of the FASTQ files. After Tophat we have a second post-alignment script that does some stuff and save the result in a subdirectory.
     68In it's simplest form a job script is only a string with one or more (bash) commands to execute. For example, `pwd; ls` is a valid job script that prints the current directory and then lists all files in it. To help you create longer and more complex scripts the `ScriptBuilder` class can be used. The `cmd()`, `echo()` and `comment()` methods are more or less self-describing. It is possible to start a command in the background with `bkgr()`, but note that this must be paired with a `waitForProcess()` otherwise the job script may finish before the commmand that is running in the background which may cause unpredictable results. The `ScriptBuilder.progress()` method is a very useful method for jobs that are expected to take a long time to run. The method writes progress information to the `{$WD}/progress` file. This information is picked up by the Open Grid Service and reported back to the BASE job that is acting as a proxy.
     69
     70When creating a job script you may find the following variables useful:
     71
     72 * `{$WD}`: A randomly generated subdirectory in the `<job-folder>` directory. The directory contains the job script and other data for the current job. This is also the current working directory when the job is started and the directory that is used for communicating data to/from the BASE server. Data in this directory is preserved after the job has finished. After a job has finished, this folder can be found by calling `OpenGridCluster.getWorkFolder()`. Files can be transferred to the BASE server with `OpenGridSession.downloadFile()`, `OpenGridSession.readFile()` or `OpenGridSession.getJobFileAsString()`. The latter method is the simplest one to use for parsing out interesting data from text result files.
     73 * `{$TMPDIR}`: A temporary working directory that is typically only available on the node the job is running on. Unless the job is started in debug mode, this directory is deleted soon after the job has finished.
     74 * `{NSLOTS}`: The number of slots that has been assigned to this job. If the job is starting a multi-threaded analysis program it is common practice to not use more threads than what this value specifies. Note that a single node may run more than one job at the same time and that one slot typically corresponds to one cpu core.
     75
     76In the code example below we assume that we have FASTQ files stored on a file server on the network. We want to align the FASTQ files with Tophat and we have a wrapper script that sets most of the parameters. We only need to provide the number of threads and the location of the FASTQ files. After Tophat we have a second post-alignment script that does some stuff and save the result in a subdirectory (`{$TMPDIR}/result`).
    7877
    7978{{{
     
    9493
    9594// Now we only need to copy the results back to our file server.
    96 // Remember that the {$TMPDIR} is cleaned automatically so we don't
    97 // have to mess with that
     95// Remember that the {$TMPDIR} is cleaned automatically so we can
     96// leave that as it is
    9897script.progress(90, "Copying analyzed data back to file server");
    9998script.cmd("cp {$TMPDIR}/result/* /path/to/resultfiles/");
     
    106105== Submitting a job ==
    107106
    108 When the job script has been generated it is time to submit the job to the cluster. For this, you'll need a couple of more objects. The first object is a `JobConfig` instance. Use this for setting various options that are related to the Open Grid [http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html qsub] command. In most cases the default settings should work, but you can for example use the `JobConfig.setPriority()` to change the priority (-p) or `JobConfig.setQsubOption()` to set almost any other option. Some options are set automatically by the job submission procedure and are ignored (-S, -terse, -N, -wd, -o, -e).
     107When the job script has been generated it is time to submit the job to the cluster. For this, you need a couple of more objects. The first object is a `JobConfig` instance. Use this for setting various options that are related to the Open Grid [http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html qsub] command. In most cases the default settings should work, but you can for example use the `JobConfig.setPriority()` to change the priority (-p) or `JobConfig.setQsubOption()` to set almost any other option. Some options are set automatically by the job submission procedure and are ignored (-S, -terse, -N, -wd, -o, -e).
    109108
    110109You also need a BASE Job item that is an `OTHER` type job. It is recommended that the job is set up so that it can easily be identified later when notification about it's completion is sent out. Remember that during the time a job executes on the Open Grid Cluster almost anything can happen on the BASE server, including a restart. Do not rely on information that is stored in memory about jobs that has been submitted to the cluster since this information may not be there when the job completes. We recommend using one or more of `Job.setName()`, `Job.setPluginVersion()` and `Job.setItemSubtype()` to be able to identify the job in a reliable manner. We will explain why this is important in the ''Getting notified when a job completes'' section below.
    111110
    112 Now it is time to create a `JobDefinition` object. This is basically a compilation containing the job script, the job configuration and the BASE job item. The `JobDefinition` is also used for uploading data files that are needed by the job. Read more about this in the ''Advanced usage'' section below.
     111The last object you need is a `JobDefinition` object. This is basically a compilation containing the job script, the job configuration and the BASE job item. The `JobDefinition` is also used for uploading data files that are needed by the job. Read more about this in the ''Advanced usage'' section below.
    113112
    114113The final step is to connect to the Open Grid Cluster and submit the job. If we assume that you know the ID of the cluster you can simply use the `OpenGridService.getClusterById()` method and then `OpenGridCluster.connect()` to create an `OpenGridSession` instance that is connected to the cluster. Then, use the `OpenGridSession.qsub()` method to submit the job. Note that this method need a `List<JobDefinition>` input parameter. If you have multiple jobs to submit it will be a lot quicker to submit all of them in one go instead of doing multiple calls to the `OpenGridSession.qsub()` method.
     
    134133
    135134// Create a new BASE job and set properties so that we can identify
    136 // it later. The 'null' paramters creates an 'OTHER' type job.
     135// it later. The 'null' parameters creates an 'OTHER' type job.
    137136Job job = Job.getNew(dc, null, null, null);
    138137job.setName("My analysis");
     
    172171== Getting notified when a job completes ==
    173172
    174 One important feature is that other extensions can get notified when a job running on the cluster has ended. This is implemented in an asynchronous manner and it should not matter if the BASE server is updated or restarted or otherwise modified while a job is running. In the background there are three parts that work together to make this feature work.
     173One important feature is that extensions can get notified when a job running on the cluster has ended. This is implemented in an asynchronous manner and it should not matter if the BASE server is updated or restarted or otherwise modified while a job is running. In the background there are three parts that work together to make this feature work.
    175174
    176175 * The BASE system for requesting job progress information about external jobs has been setup to send requests to the `OpenGridService` whenever it want new information about a job. This is the reason why it is important to create a BASE job item as a proxy for the Open Grid Cluster jobs. Without it, no progress information is requested and we never get to know when the job has ended.
    177176 * The `OpenGridService` is polling each registered cluster at regular intervals. Typically once every minute but it may be more or less often depending on if there are any known jobs executing or not. The `OpenGridSession.qstat()` and `OpenGridSession.qacct()` methods are used for this and will detect waiting, running and completed jobs. For running jobs, the service downloads the `progress` file (see `ScriptBuilder.progress()` above) and update the progress information in the BASE database.
    178  * Once a job has been detected as completed the service will initiate the job completion sequence. This is implemented as a custom extension point (`net.sf.basedb.opengrid.job-complete`) that receive messages about completed jobs. Extensions that want to get notified should extend this extension point. Note that all registered extensions are notified about all jobs. It doesn't matter which extension that originally submitted the job to the cluster. Notifications are sent both for successful and failed jobs. '''Each extension is responsible for filtering and ignoring notifications about jobs that is of no interest to them'''. This is why it is important to set name, plug-in version, etc. on the job when submitting it. We recommend that this filtering step is done in the `ActionFactory` implementation that is registered for the `net.sf.basedb.opengrid.job-complete` extension point. Note that a single notification may handle more than one completed job. Thus, the `prepareContext()` method is called once and without any information about the jobs while the the `getActions()` method is called once for every job.
     177 * Once a job has been detected as completed the service will initiate the job completion sequence. This is implemented as a custom extension point (`net.sf.basedb.opengrid.job-complete`) that receive messages about completed jobs. Extensions that want to get notified should extend this extension point. Note that all registered extensions are notified about all jobs! It doesn't matter which extension that originally submitted the job to the cluster. Notifications are sent both for successful and failed jobs. '''Each extension is responsible for filtering and ignoring notifications about jobs that is of no interest to them'''. This is why it is important to set name, plug-in version, etc. on the job when submitting it. We recommend that this filtering step is done in the `ActionFactory` implementation that is registered for the `net.sf.basedb.opengrid.job-complete` extension point. Note that a single notification may handle more than one completed job. Thus, the `prepareContext()` method is called once and without any information about the jobs while the the `getActions()` method is called once for every job.
    179178
    180179{{{
     
    233232}}}
    234233
    235 The `ActionFactory.getActions()` implementation should not do anything except checking if the job is of interest or not. It should return `null` if it is not interested in the job, and an implementation of the `JobCompletionHandler` interface otherwise. This interface defines a single method: `JobCompletionHandler.jobCompleted(SessionControl, OpenGridSession, Job, JobStaus)`. The `Job` and `JobStatus` objects are the same as in the `ActionFactory`, but in this method you also get access to a `SessionControl` instance and a connected `OpenGridSession` to the cluster the job was running on. The `OpenGridSession` can for example be used to download and parse result files. The `SessionControl` can be used to access BASE and update items and/or annotations. The good thing about the `SessionControl` is that it has been automatically configured so that the owner of the job is already logged in and a project (if any is specified on the job) is set as the active project (in the `ActionFactory` the session control is a generic one with the root user logged in).
     234The `ActionFactory.getActions()` implementation should not do anything except check if the job is of interest or not. It should return `null` if it is not interested in the job, and an instance implementing the `JobCompletionHandler` interface otherwise. This interface defines a single method: `JobCompletionHandler.jobCompleted(SessionControl, OpenGridSession, Job, JobStaus)`. The `Job` and `JobStatus` objects are the same as in the `ActionFactory`, but in this method you also get access to a `SessionControl` instance and a connected `OpenGridSession` to the cluster the job was running on. The `OpenGridSession` can for example be used to download and parse result files. The `SessionControl` can be used to access BASE and update items and/or annotations. The good thing about the `SessionControl` is that it has been automatically configured so that the owner of the job is already logged in and a project (if any is specified on the job) is set as the active project (in the `ActionFactory` the session control is a generic one with the root user logged in).
    236235
    237236Do not update the `Job` item since this may interfere with the updates to the job made by the Open Grid extension. The method may return a string to set the status message of the job, or throw an exception to set the job status to ERROR.
     
    276275=== Uploading data files as part of a !JobDefinition ===
    277276
    278 The `JobDefinition` that is used for submitting a job to an Open Grid Cluster has the ability to upload files that are needed for the job. This is done by calling the `JobDefinition.addFile()` which need an `UploadSource` object as a parameter. The `UploadSource` is an interface but we have provided several implementations that wraps, for example, a `String`, a BASE `File` item or an `InputStream`.
     277The `JobDefinition` that is used for submitting a job to an Open Grid Cluster has the ability to upload files that are needed for the job. This is done by calling the `JobDefinition.addFile()` method with an `UploadSource` parameter. The `UploadSource` is an interface but we have provided several implementations that wraps, for example, a `String`, a BASE `File` item or an `InputStream`.
    279278
    280279Note that calling the `JobDefinition.addFile()` method doesn't start the upload immediately. The upload happens in the `OpenGridSession.qsub()` method. The file is placed in the subfolder to the `<job-folder>` that has been created for the job (the `{$WD}` folder).
    281280
     281
     282{{{
     283#!java
     284JobDefinition jobDef = ...
     285String data = ... // A string with 'data' that should be uploaded
     286
     287UploadSource src = new StringUploadSource("data.csv", data);
     288jobDef.addFile(src);
     289
     290// Uploads the data to {$WD}/data.csv
     291session.qsub( ... jobDef ...);
     292}}}
     293
    282294=== Connecting to non-Open Grid servers ===
    283295
    284 The connection made to Open Grid Clusters is a regular SSH connection. There is really nothing that is special about the connection itself. This means that it is possible to connect to more or less any server that supports SSH. It doesn't matter if the server is running an Open Grid Cluster or not. Note that servers that are defined in the `opengrid-config.xml` are expected to be Open Grid Cluster servers and the `OpenGridService` implementation will try to call Open Grid commands on them.
     296Connections that are made to Open Grid Clusters are regular SSH connections. There is really nothing that is special about the connection itself. This means that it is possible to connect to more or less any server that supports SSH. It doesn't matter if the server is running an Open Grid Cluster or not. Note that servers that are defined in the `opengrid-config.xml` are expected to be Open Grid Cluster servers and the `OpenGridService` implementation will try to call Open Grid commands on them.
    285297
    286298However, it is possible to programmatically create a `ConnectionInfo` instance and use it for creating a `RemoteHost` object. With this you can connect to the server by calling the `RemoteHost.connect()` method which returns a `RemoteSession` object. It is very similar to what can be done with `OpenGridCluster`/`OpenGridSession` objects, except that the special methods for calling Open Grid Cluster commands are not available.
     
    290302=== Tracking non-Open Grid jobs ===
    291303
    292 Sometimes there are other things going on that are not Open Grid jobs that would be interesting to track. One example is the sequencing progress of a sequencer machine. In this case we want to know when the sequencing has been completed and then start analysis jobs (as Open Grid Cluster jobs). A simple bash script has been implemented (http://baseplugins.thep.lu.se/browser/other/pipeline/trunk/nextseq_status.sh) that checks if all result files from the sequencing are present on the file server or not. We want to run this script at regular intervals. When all data is present, run some checks and start the analysis jobs. There are three steps to consider:
     304Sometimes there are other things going on that are not Open Grid jobs that would be interesting to track. One example is the sequencing progress of a sequencer machine. In this case we want to know when the sequencing has been completed and then start analysis jobs (as Open Grid Cluster jobs). A simple bash script has been implemented (http://baseplugins.thep.lu.se/browser/other/pipeline/trunk/nextseq_status.sh) that checks if all result files from the sequencing are present on the file server or not. We want to run this script at regular intervals. When all data is present, we run some checks to validate the sequence data and if all seems to be good, we start the analysis pipeline. There are three steps to consider:
    293305
    294306 * The sequencing process should be represented by a BASE job item as a proxy. Progress reporting need to be setup using the extension mechanism implemented in the BASE core. This need to be implemented completely by the other extension. It is not possible to re-use the setup the Open Grid package uses. In the code example below the `SequencingSignalHandler` class is assumed to take care of this.
     
    333345               
    334346      JobIdentifier jobId = new JobIdentifier(clusterId, flowCellId, job.getId());
     347      // Register a callback that eventually will call the
     348      // getJobStatus() method below
    335349      OpenGridService.getInstance().asyncJobStatusUpdate(jobId, this);
    336350   }
     
    366380}}}
    367381
    368  * When the sequencing has been completed (`status == Job.Status.DONE`) the normal job completion routines in the Open Grid extension which will notify all registered `JobCompletionHandler` implementations. The other extension simply need to extend the `JobCompletionHandler` implementation to be able to detect the sequencing job and then do whatever needs to be done with that.
     382 * When the sequencing has been completed (`status == Job.Status.DONE`) the normal job completion routines in the Open Grid extension notifies all registered `JobCompletionHandler` implementations. The other extension simply need to extend the `JobCompletionHandler` implementation to be able to detect the sequencing job and then do whatever needs to be done with that.
    369383
    370384{{{
     
    416430}}}
    417431
     432Now, every time the Open Grid Scheduler service is restarted, BASE calls the `MyEventHandler.handleEvent()` method.
    418433