Opened 10 years ago
Closed 10 years ago
#682 closed enhancement (fixed)
Improve performance when submitting jobs to cluster
Reported by: | Nicklas Nordborg | Owned by: | Nicklas Nordborg |
---|---|---|---|
Priority: | major | Milestone: | Reggie v2.17 |
Component: | net.sf.basedb.reggie | Keywords: | |
Cc: |
Description
Submitting multiple jobs to the cluster can take a long time. For example, submitting align jobs for an entire plate may take several minutes.
After some initial investigations it seems like the minimal time it takes to submit a single command via SSH to the cluster is around 0.25 seconds. One align job requires 3 commands and one file transfer so it takes at least one second per job submitted.
When submitting several jobs, there are things that can be improved:
- Two commands are used to make sure that the job working directory is empty and exists:
rm -rf <dir>
andmkdir <dir>
. When submitting multiple jobs we could do all of this in a single command.
- Sending the job scripts opens a new
SCPFileTransfer
every time. I have not checked if it is possible to re-use the same instance for transferring multiple files and if that is faster.
- Registering jobs on the cluster should also be possible to do in a batch. It should just be a matter of keeping track of the order in the output file so we assign the correct job id to each job.
Change History (4)
comment:2 by , 10 years ago
comment:3 by , 10 years ago
(In [2873]) References #682: Improve performance when submitting jobs to cluster
Changed implementation for aborting jobs so that the signal is received and stored until the next service timer cycle (in OpenGridService
). The actual aborting is then performed in OpenGridCluster.abortJobs()
.
This makes the GUI for aborting jobs much more responsive since we don't have to wait for each 'qdel' command to be processed. Instead the job is set to 'Aborting' state which will change to 'Error' once the job has been aborted on the cluster.
comment:4 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
(In [2881]) Fixes #682: Run picard AddOrReplaceReadGroups after alignment
Further investigations:
rm
andmkdir
into one SSH command takes more or less the same time as executing a single command. We save a lot of time by doing this.SCPFileTransfer
instance didn't make any improvements, but switching to using a singleSFTPClient
cut file transfer times to a few milliseconds. We can save a lot of time by doing this.With everything in place I think we could get down to 1% of the original time for a batch of ~100 jobs!