Opened 10 years ago
Closed 10 years ago
#758 closed defect (fixed)
Open grid sceduler jobs with no job id in BASE
Reported by: | Nicklas Nordborg | Owned by: | Nicklas Nordborg |
---|---|---|---|
Priority: | major | Milestone: | Reggie v3.2.1 |
Component: | net.sf.basedb.reggie | Keywords: | |
Cc: |
Description
For some reason a (cufflinks) job was submitted to the cluster but no job id was returned. The job was queued and executed successfully but we could not get any information about it without the job id. Trying to abort the job on the BASE side didn't work either since this triggered a command to the cluster to abort a job with an unknown id (the cluster probably answered with an error message).
The status checking and abort functionality for jobs should be updated to not send commands to the cluster if the job id is not known.
It would also be good if this could be checked before registering the job but I need to check if it is possible to get information about the error in this case.
(In [3194]) Fixes #758: Open grid sceduler jobs with no job id in BASE
This was a bit tricky but I think it has been fixed now. It is still a bit unclear what caused the issue to begin with. The main suspect is a threading issue in
SshHost.executeCmd()
method which starts two threads for reading stdout/stderr, but doesn't make sure that the threads has finished before trying to read the result. This could have caused some output from the SSH command to be missed.As an extra safety the places that use a job-id to get information from the cluster have been modifed so that the simulate an error which should then be handled by the calling code using the normal error handling. This is done in:
OpenGridCluser.abortJob()
OpenGridCluster.updateJobStatus()
In addition to this the
OpenGridCluster.submitJobs()
also check that a job-id is returned for each submitted job. This is handled in the servlets and returned as error messages to the user.Hopefully we will never see this issue again, but if we do it should be handled better and it should be possible to abort the jobs and cleanup.