Opened 10 years ago
Closed 10 years ago
#614 closed task (fixed)
Improve error handling when executing jobs on the cluster
Reported by: | Nicklas Nordborg | Owned by: | Nicklas Nordborg |
---|---|---|---|
Priority: | major | Milestone: | Reggie v2.16 |
Component: | net.sf.basedb.reggie | Keywords: | |
Cc: |
Description
Error handling is important for the functionality of the Reggie<->Cluster integration. In the ideal world all scripts should work as expected and produce some results, and when they don't a sensible error message should be given and made visible through the Reggie/BASE interface.
The cluster wrapper currently assumes that scripts that are successful return exit code=0 and scripts that fail return exit code != 0. If the script fails the contents of the 'stderr' stream is used as the error message. Unfortunately not all scripts/programs we use follows this pattern. The first step is to investigate the programs and see if we can make them behave more like we want. The (possible incomplete) list of programs:
- picard
- trimmomatic
- bowtie2
- tophat
- samtools
- the pipeline scripts
Change History (14)
comment:1 by , 10 years ago
comment:2 by , 10 years ago
Trimmomatic
The exit code for Trimmomatic seems to behave as expected.
Trimommatic writes error message and other information to stderr. This information include statistics about the trimming and number of reads that have passed. We currently redirect all output to 'trimmomatic.out' which is saved in the job folder on the cluster (prime) so that we can parse some numbers from it and add as annotations to the bioassays (PT_READS). This means that error messages from Trimmomatic are currently not stored in BASE and if there is an error in this step the user must check the 'trimmomatic.out' file for more information.
comment:3 by , 10 years ago
bowtie2
May not detect all errors. For example, if the FASTQ input files are missing, it will run as normal and produce 0-length FASTQ output files. Otherwies it seems to give exit code != 0 if there is an error.
All output (except the FASTQ files) go to stderr including the metrics that we are interested in importing back to BASE as annotations. But note that bowtie2 is stared via PE_filter.sh that changes things a bit.
PE_filer.sh
Pipeline script for starting bowtie2. The current configuration redirect all output to filter.out
. But the output in this case is normally limited to the command line that starts bowtie2. All output from bowtie2 is redirected to a separate file based on the name of the library (for example 1106394.1.l.r.m.c.lib.g6_R0_fastq.gz.out
).
So error messages from bowtie2 will go to one file, and errors from the PE_filter.sh script will go to another.
comment:4 by , 10 years ago
I have tested one idea with a wrapper script that starts a program (eg. trimmomatic). The wrapper redirects both stdout and stderr to a temporary file. If the program exits with 0, the temporary file is copied to stdout. If the program exists with any other value the temporary file is copied to stderr. This should ensure that error messages always end up in the global stderr and then we only have to redirect stdout to the usual file. For example:
./stdwrap.sh ./trimmomatic [parameters] > trimmomatic.out
The drawback is that all output is going to a temporary file at first and it is more difficult to view partial results as the program is working.
comment:5 by , 10 years ago
(In [2538]) References #614: Improve error handling when executing jobs on the cluster
Use the new wrapper script for trimmomatic. See http://baseplugins.thep.lu.se/browser/other/pipeline/trunk/stdwrap.sh
comment:6 by , 10 years ago
(In [2552]) References #614: Improve error handling when executing jobs on the cluster
Got rid of the problematic PE_filter.sh
script for running bowtie. The servlet now generate an equivalent call to bowtie, but output files are named a bit differently so the tophat step follwing is not currently working.
comment:7 by , 10 years ago
tophat
All interesting output is going to files in a folder given as an argument to tophat. Progress information and error messages go to stderr and exit code != 0 if there is an error. Seems like it is not possible to stop tophat from writing progress information, so we need to use the wrapper script to get rid of the progress information from global stderr when tophat completes successfully. If there is an error it may also mean that stderr is full of irrelevant information and that the summary shown in BASE is not including the actual error message. The "Stack trace" tab on the job information dialog should show more information.
comment:8 by , 10 years ago
comment:9 by , 10 years ago
(In [2554]) References #614: Improve error handling when executing jobs on the cluster
Added call to picard MarkDuplicates
. It is using the unmodifed picard version since the script has not set the PicardDir
option. This option is currently specified in the <demux> section. I think we should re-arrange some configuration options so that they can be re-used in different places.
comment:10 by , 10 years ago
comment:11 by , 10 years ago
samtools
Seems like only error messages go to stderr and exit code != 0 in this case. Interested output either go to stdout or a given file. Hurray!! We don't have to use the wrapper script!
comment:12 by , 10 years ago
(In [2556]) References #614: Improve error handling when executing jobs on the cluster
Running samtools index ...
which was the last part from the tophat script. Created a new statistics script, alignment_statistics.sh
which replaces the statistics_tophat.sh
. The new scripts works on a single folder given as an argument to the script instead of using a file with paths to BAM files.
Since this script only uses 'samtools' no redirection wrapper is not needed.
comment:13 by , 10 years ago
comment:14 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Picard
The exit code for Picard seems to behave as expected.
The standard Picard distribution write error messages and other information to stderr, which is very annoying when running in debug mode since it produces a very large file and if there is an error at the end it is beyond the maximum size that BASE can store. Other data, such as FASTQ files and demultiplex metrics are written to separate files controlled from the command line and doesn't cause any problems.
To solve logging issue we have modified the Picard 'htsjdk' module so that only error messages are written to stderr and all other messages to stdout. See https://github.com/nnordborg/htsjdk/commit/4b4d89908ea32c1231e4ec4fdeae936b675d935e
This modification is included in our fork of Picard: https://github.com/nnordborg/picard/tree/lorry