Opened 11 years ago

Closed 10 years ago

#614 closed task (fixed)

Improve error handling when executing jobs on the cluster

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: Reggie v2.16
Component: net.sf.basedb.reggie Keywords:
Cc:

Description

Error handling is important for the functionality of the Reggie<->Cluster integration. In the ideal world all scripts should work as expected and produce some results, and when they don't a sensible error message should be given and made visible through the Reggie/BASE interface.

The cluster wrapper currently assumes that scripts that are successful return exit code=0 and scripts that fail return exit code != 0. If the script fails the contents of the 'stderr' stream is used as the error message. Unfortunately not all scripts/programs we use follows this pattern. The first step is to investigate the programs and see if we can make them behave more like we want. The (possible incomplete) list of programs:

  • picard
  • trimmomatic
  • bowtie2
  • tophat
  • samtools
  • the pipeline scripts

Change History (14)

comment:1 by Nicklas Nordborg, 11 years ago

Picard

The exit code for Picard seems to behave as expected.

The standard Picard distribution write error messages and other information to stderr, which is very annoying when running in debug mode since it produces a very large file and if there is an error at the end it is beyond the maximum size that BASE can store. Other data, such as FASTQ files and demultiplex metrics are written to separate files controlled from the command line and doesn't cause any problems.

To solve logging issue we have modified the Picard 'htsjdk' module so that only error messages are written to stderr and all other messages to stdout. See https://github.com/nnordborg/htsjdk/commit/4b4d89908ea32c1231e4ec4fdeae936b675d935e

This modification is included in our fork of Picard: https://github.com/nnordborg/picard/tree/lorry

comment:2 by Nicklas Nordborg, 11 years ago

Trimmomatic

The exit code for Trimmomatic seems to behave as expected.

Trimommatic writes error message and other information to stderr. This information include statistics about the trimming and number of reads that have passed. We currently redirect all output to 'trimmomatic.out' which is saved in the job folder on the cluster (prime) so that we can parse some numbers from it and add as annotations to the bioassays (PT_READS). This means that error messages from Trimmomatic are currently not stored in BASE and if there is an error in this step the user must check the 'trimmomatic.out' file for more information.

comment:3 by Nicklas Nordborg, 11 years ago

bowtie2

May not detect all errors. For example, if the FASTQ input files are missing, it will run as normal and produce 0-length FASTQ output files. Otherwies it seems to give exit code != 0 if there is an error.

All output (except the FASTQ files) go to stderr including the metrics that we are interested in importing back to BASE as annotations. But note that bowtie2 is stared via PE_filter.sh that changes things a bit.

PE_filer.sh

Pipeline script for starting bowtie2. The current configuration redirect all output to filter.out. But the output in this case is normally limited to the command line that starts bowtie2. All output from bowtie2 is redirected to a separate file based on the name of the library (for example 1106394.1.l.r.m.c.lib.g6_R0_fastq.gz.out).

So error messages from bowtie2 will go to one file, and errors from the PE_filter.sh script will go to another.

comment:4 by Nicklas Nordborg, 11 years ago

I have tested one idea with a wrapper script that starts a program (eg. trimmomatic). The wrapper redirects both stdout and stderr to a temporary file. If the program exits with 0, the temporary file is copied to stdout. If the program exists with any other value the temporary file is copied to stderr. This should ensure that error messages always end up in the global stderr and then we only have to redirect stdout to the usual file. For example:

./stdwrap.sh ./trimmomatic [parameters] > trimmomatic.out

The drawback is that all output is going to a temporary file at first and it is more difficult to view partial results as the program is working.

comment:5 by Nicklas Nordborg, 11 years ago

(In [2538]) References #614: Improve error handling when executing jobs on the cluster

Use the new wrapper script for trimmomatic. See http://baseplugins.thep.lu.se/browser/other/pipeline/trunk/stdwrap.sh

comment:6 by Nicklas Nordborg, 10 years ago

(In [2552]) References #614: Improve error handling when executing jobs on the cluster

Got rid of the problematic PE_filter.sh script for running bowtie. The servlet now generate an equivalent call to bowtie, but output files are named a bit differently so the tophat step follwing is not currently working.

comment:7 by Nicklas Nordborg, 10 years ago

tophat

All interesting output is going to files in a folder given as an argument to tophat. Progress information and error messages go to stderr and exit code != 0 if there is an error. Seems like it is not possible to stop tophat from writing progress information, so we need to use the wrapper script to get rid of the progress information from global stderr when tophat completes successfully. If there is an error it may also mean that stderr is full of irrelevant information and that the summary shown in BASE is not including the actual error message. The "Stack trace" tab on the job information dialog should show more information.

comment:8 by Nicklas Nordborg, 10 years ago

(In [2553]) References #614: Improve error handling when executing jobs on the cluster

Started to incorporate alignment step into main script. The call to to tophat seems to be working, but there are several additional steps from the tophat_single.sh script that remains.

comment:9 by Nicklas Nordborg, 10 years ago

(In [2554]) References #614: Improve error handling when executing jobs on the cluster

Added call to picard MarkDuplicates. It is using the unmodifed picard version since the script has not set the PicardDir option. This option is currently specified in the <demux> section. I think we should re-arrange some configuration options so that they can be re-used in different places.

comment:10 by Nicklas Nordborg, 10 years ago

(In [2555]) References #614: Improve error handling when executing jobs on the cluster

Re-organized the configuration file which should make it easier to re-use some configuration settings and also to understand when the different settings are used.

comment:11 by Nicklas Nordborg, 10 years ago

samtools

Seems like only error messages go to stderr and exit code != 0 in this case. Interested output either go to stdout or a given file. Hurray!! We don't have to use the wrapper script!

comment:12 by Nicklas Nordborg, 10 years ago

(In [2556]) References #614: Improve error handling when executing jobs on the cluster

Running samtools index ... which was the last part from the tophat script. Created a new statistics script, alignment_statistics.sh which replaces the statistics_tophat.sh. The new scripts works on a single folder given as an argument to the script instead of using a file with paths to BAM files.

Since this script only uses 'samtools' no redirection wrapper is not needed.

comment:13 by Nicklas Nordborg, 10 years ago

(In [2559]) References #614: Improve error handling when executing jobs on the cluster

Copying results files to the project archive. All *.out files are also included and they are also copied to the job folder so BASE can import some information.

comment:14 by Nicklas Nordborg, 10 years ago

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.