Opened 10 years ago

Closed 10 years ago

#758 closed defect (fixed)

Open grid sceduler jobs with no job id in BASE

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: Reggie v3.2.1
Component: net.sf.basedb.reggie Keywords:
Cc:

Description

For some reason a (cufflinks) job was submitted to the cluster but no job id was returned. The job was queued and executed successfully but we could not get any information about it without the job id. Trying to abort the job on the BASE side didn't work either since this triggered a command to the cluster to abort a job with an unknown id (the cluster probably answered with an error message).

The status checking and abort functionality for jobs should be updated to not send commands to the cluster if the job id is not known.

It would also be good if this could be checked before registering the job but I need to check if it is possible to get information about the error in this case.

Change History (1)

comment:1 by Nicklas Nordborg, 10 years ago

Resolution: fixed
Status: newclosed

(In [3194]) Fixes #758: Open grid sceduler jobs with no job id in BASE

This was a bit tricky but I think it has been fixed now. It is still a bit unclear what caused the issue to begin with. The main suspect is a threading issue in SshHost.executeCmd() method which starts two threads for reading stdout/stderr, but doesn't make sure that the threads has finished before trying to read the result. This could have caused some output from the SSH command to be missed.

As an extra safety the places that use a job-id to get information from the cluster have been modifed so that the simulate an error which should then be handled by the calling code using the normal error handling. This is done in:

  • OpenGridCluser.abortJob()
  • OpenGridCluster.updateJobStatus()

In addition to this the OpenGridCluster.submitJobs() also check that a job-id is returned for each submitted job. This is handled in the servlets and returned as error messages to the user.

Hopefully we will never see this issue again, but if we do it should be handled better and it should be possible to abort the jobs and cleanup.

Note: See TracTickets for help on using tickets.