Opened 9 years ago

Closed 9 years ago

#682 closed enhancement (fixed)

Improve performance when submitting jobs to cluster

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: Reggie v2.17
Component: net.sf.basedb.reggie Keywords:
Cc:

Description

Submitting multiple jobs to the cluster can take a long time. For example, submitting align jobs for an entire plate may take several minutes.

After some initial investigations it seems like the minimal time it takes to submit a single command via SSH to the cluster is around 0.25 seconds. One align job requires 3 commands and one file transfer so it takes at least one second per job submitted.

When submitting several jobs, there are things that can be improved:

  • Two commands are used to make sure that the job working directory is empty and exists: rm -rf <dir> and mkdir <dir>. When submitting multiple jobs we could do all of this in a single command.
  • Sending the job scripts opens a new SCPFileTransfer every time. I have not checked if it is possible to re-use the same instance for transferring multiple files and if that is faster.
  • Registering jobs on the cluster should also be possible to do in a batch. It should just be a matter of keeping track of the order in the output file so we assign the correct job id to each job.

Change History (4)

comment:1 by Nicklas Nordborg, 9 years ago

Further investigations:

  • Batching all rm and mkdir into one SSH command takes more or less the same time as executing a single command. We save a lot of time by doing this.
  • Switching to using a single SCPFileTransfer instance didn't make any improvements, but switching to using a single SFTPClient cut file transfer times to a few milliseconds. We can save a lot of time by doing this.

With everything in place I think we could get down to 1% of the original time for a batch of ~100 jobs!

Last edited 9 years ago by Nicklas Nordborg (previous) (diff)

comment:2 by Nicklas Nordborg, 9 years ago

(In [2872]) References #682: Improve performance when submitting jobs to cluster

Changed job submission so that it performs better when submitting multiple jobs.

All file transfers should now use SFTP instead of SCP.

comment:3 by Nicklas Nordborg, 9 years ago

(In [2873]) References #682: Improve performance when submitting jobs to cluster

Changed implementation for aborting jobs so that the signal is received and stored until the next service timer cycle (in OpenGridService). The actual aborting is then performed in OpenGridCluster.abortJobs().

This makes the GUI for aborting jobs much more responsive since we don't have to wait for each 'qdel' command to be processed. Instead the job is set to 'Aborting' state which will change to 'Error' once the job has been aborted on the cluster.

comment:4 by Nicklas Nordborg, 9 years ago

Resolution: fixed
Status: newclosed

(In [2881]) Fixes #682: Run picard AddOrReplaceReadGroups after alignment

NOT TRUE, [2881] FIXES #684

Last edited 9 years ago by Nicklas Nordborg (previous) (diff)
Note: See TracTickets for help on using tickets.