Opened 11 years ago

Closed 10 years ago

Last modified 10 years ago

#533 closed task (fixed)

Add secondary analysis section to Reggie

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: critical Milestone: Reggie v2.16
Component: net.sf.basedb.reggie Keywords:
Cc:

Description (last modified by Nicklas Nordborg)

This is the master ticket for adding secondary analysis registration functionality to Reggie. The secondary analysis is the steps done from sequencing down to expression values have been generated, including demux and alignment against a reference genome.

Note! Primary analysis is the base calling performed by the Illumina software during the sequencing.

The pipeline will be something like this. See the other tickets (to be created) for more information about each step:

  1. (#545) Register sequencing as ended. Part of the "Library preparation wizards" section and done by someone in the lab.
  1. (#546) Confirm sequencing as completed. First wizard in the "Secondary analysis wizards" section. Used to decide if the sequenced data is ok or not. If ok, continue with demuxing, otherwise flag pools for re-sequencing.
  1. (#547) Start demux and merge. This wizard starts the demux and merge operations.
  1. (#548) Register demux and merge as ended. At the end we have one "MergedSequences" item for each "Library" from the flow cells that was sequenced. A count of the number of reads for each library must be recorded and is used to determine if the library needs to be re-sequenced or not. FASTQ files for each library are stored on the server.
  1. (#593) Start filtering and alignment. Bowtie and TopHat is used to first filter and then align against a pre-defined set of transcripts.
  1. (#595) Register filtering and alignment as ended. At the end we have one "AlignedSequences" item for each "Library" from the flow cells that was sequenced. BAM files for each library are stored on the server.

The remaining issues are postponed to a later release.

  1. Start feature extraction. Cufflinks is used to calculate expression values.
  1. Register feature extraction. At the end we have one "RawBioAssay" item for each "Library" from the flow cells that was sequenced. FPKM files are uploaded to BASE and imported into the database.

Attachments (2)

secondary-analysis-overview-v1.pdf (72.6 KB ) - added by Nicklas Nordborg 11 years ago.
Overview PDF v1
secondary-analysis-overview-v2.pdf (71.5 KB ) - added by Nicklas Nordborg 11 years ago.
Second version of the overview

Download all attachments as: .zip

Change History (47)

comment:1 by Nicklas Nordborg, 11 years ago

Milestone: Reggie v2.xReggie v2.15

comment:2 by Nicklas Nordborg, 11 years ago

Description: modified (diff)
Summary: Add primary analysis section to ReggieAdd secondary analysis section to Reggie

comment:3 by Nicklas Nordborg, 11 years ago

Description: modified (diff)

comment:4 by Nicklas Nordborg, 11 years ago

Description: modified (diff)

comment:5 by Nicklas Nordborg, 11 years ago

Description: modified (diff)

comment:6 by Nicklas Nordborg, 11 years ago

Description: modified (diff)

by Nicklas Nordborg, 11 years ago

Overview PDF v1

comment:7 by Nicklas Nordborg, 11 years ago

(In [2196]) References #533: Add secondary analysis section to Reggie

Re-arranged the index page to make room for the "Secondary analysis wizards" section. The page is now divided in three columns with the new section as the middle column.

Added link to "Register sequencing ended" wizard as the last library preparation wizard. Created two new annotation so that the counter can show a correct value.

by Nicklas Nordborg, 11 years ago

Second version of the overview

comment:8 by Nicklas Nordborg, 11 years ago

(In [2225]) References #533: Register sequencing as ended

New version of the 'sequencing ended' wizard. Added an option to specify if first base report failed. If so, the 'sequencing startup' wizard is enabled again for the flow cell. Other changes are mostly related to that there is only a single parent FlowCell for each SequencingRun.

comment:9 by Nicklas Nordborg, 11 years ago

Milestone: Reggie v2.15Reggie v2.16

comment:10 by Nicklas Nordborg, 10 years ago

(In [2292]) References #533: Add secondary analysis section to Reggie

Added SSHJ lib.

comment:11 by Nicklas Nordborg, 10 years ago

(In [2293]) References #533: Add secondary analysis section to Reggie

Adding some classes for keeping track of SSH/Open grid scheduler hosts. With a properly configured configuration file (reggie-ogs-hosts.xml) it seems to be possible to connect and execute a simple command on the remote server. Not much error handling though.

comment:12 by Nicklas Nordborg, 10 years ago

Status: newassigned

comment:13 by Nicklas Nordborg, 10 years ago

(In [2295]) References #533: Add secondary analysis section to Reggie

Added a page listing connected OGS clusters.

Added some more error handling when sending commands to servers via SSH.

comment:14 by Nicklas Nordborg, 10 years ago

(In [2296]) References #533: Add secondary analysis section to Reggie

Added a service extension for OpenGridService so that we can control it from the web interface (eg. reload settings after a configuration change).

Added OpenGridSignalHandlerFactory for taking care of ABORT and STATUS updates for jobs on a cluster. The demux servlet is used as a test bed but currently just faking communication with the cluster.

comment:15 by Nicklas Nordborg, 10 years ago

(In [2297]) References #533: Add secondary analysis section to Reggie

Added a "console"-like page for manually executing commands via ssh on a remote server.

comment:16 by Nicklas Nordborg, 10 years ago

(In [2298]) References #533: Add secondary analysis section to Reggie

Reorganize a few files.

comment:17 by Nicklas Nordborg, 10 years ago

(In [2299]) References #533: Add secondary analysis section to Reggie

Test for adding a "real" fake job to the cluster queue. The job id is stored as the Job.externalId in BASE and in the 'signal handler'.

The OpenGridSignalHandler class has been updated with "real" support for the "ABORT" signal. Seems to work, but error handling is non-existent.

Added logging to make debugging easier.

comment:18 by Nicklas Nordborg, 10 years ago

(In [2300]) References #533: Add secondary analysis section to Reggie

First attempts to keep track of job status. Added OpenGridCluster.submitJob() to start jobs.

Added OpenGridService.jobStatusTimer and OpenGridCluster.updateJobStatus() that is used for regular checking with the cluster. Sending 'qstat -xml' should get a list of queued and running jobs. Internally we keep a list of job ids that the BASE server has been interested in and if some of those were not listed in 'qstat' it is probably because the job has ended. We need to get information about that using 'qacct' (to be implemented) and possible with other commands. For debugging purposes we simply set those to DONE since they would otherwise be left in EXECUTING state forever.

comment:19 by Nicklas Nordborg, 10 years ago

(In [2301]) References #533: Add secondary analysis section to Reggie

Avoid duplicating status checks with the cluster (when 'qacct' has been impemented) if the BASE server is submitting status update requests with higher frequence than the actual status updating is happening on our end (eg. if getJobStatus() is called by another thread while updateJobStatus() is executed by the timer thread).

comment:20 by Nicklas Nordborg, 10 years ago

(In [2302]) References #533: Add secondary analysis section to Reggie

Get job results via 'qacct' command.

comment:21 by Nicklas Nordborg, 10 years ago

(In [2305]) References #533: Add secondary analysis section to Reggie

Generating script for 'qsub' on the BASE server side that is transmitted to the cluster via scp before it is added to the queue. Need to think a bity about naming conventions and where to place stdout and stderr files.

comment:22 by Nicklas Nordborg, 10 years ago

(In [2309]) References #533: Add secondary analysis section to Reggie

Using the new ability to access BASE from a service extension (http://base.thep.lu.se/ticket/1799) to update job status.

Improved error handling a bit. If a job fails the first part of stderr is used as an error message to the user.

comment:23 by Nicklas Nordborg, 10 years ago

(In [2310]) References #533: Add secondary analysis section to Reggie

Appending node name to the cluster name when the job is running. 't' status is appearing a short time between 'qw' and 'r' and we handle it the same as 'qw'.

comment:24 by Nicklas Nordborg, 10 years ago

(In [2323]) References #533: Add secondary analysis section to Reggie

Adding possibility to set configuration values for programs used on the cluster.

comment:25 by Nicklas Nordborg, 10 years ago

(In [2324]) References #533: Add secondary analysis section to Reggie

Creating a separate working folder for each job. The configuration file need to specify a base directory in attribute 'job-folder'. Each job get a subdirectory in this folder: <job-folder>/job-name. The auto-generated job script is always named 'job.sh' and output streams 'stdout' and 'stderr'.

comment:26 by Nicklas Nordborg, 10 years ago

(In [2325]) References #533: Add secondary analysis section to Reggie

Waiting 10 seconds after parsing qstat information before calling qacct since it has happened that recently finished jobs are not seen by qacct (eg. it returns error: job id <id> not found). I hope 10 seconds is enough.

comment:27 by Nicklas Nordborg, 10 years ago

(In [2336]) References #533: Add secondary analysis section to Reggie

Last checkin [2325] didn't solve the problem of missing jobs. Seems like there is a delay before information about finished jobs are written to the log file (https://arc.liv.ac.uk/pipermail/gridengine-users/2005-August/006253.html). The default value for the flush_time seems to be 15 seconds.

Instead of changing the Thread.sleep time, the code now accepts a first "job not found" error but if it happens a second time (which it shouldn't since we ony check once per minute) the job is set to error status.

comment:28 by Nicklas Nordborg, 10 years ago

Description: modified (diff)

comment:29 by Nicklas Nordborg, 10 years ago

(In [2376]) References #533: Add secondary analysis section to Reggie

Use 'x' instead of 'dx' as suffix for DemuxedSequences.

comment:30 by Nicklas Nordborg, 10 years ago

(In [2390]) References #533: Add secondary analysis section to Reggie

Use the word 'confirm' instead of 'register'

comment:31 by Nicklas Nordborg, 10 years ago

(In [2392]) References #533: Add secondary analysis section to Reggie

Added counter for unconfirmed alignment results.

comment:32 by Nicklas Nordborg, 10 years ago

(In [2400]) References #533: Add secondary analysis section to Reggie

Change '_end' in jsp names to '_confirm'.

comment:33 by Nicklas Nordborg, 10 years ago

(In [2420]) References #533, #547, #548, #593, #595. Renamed FilteredSequences subtype to MaskedSequences and the related software and protocol type. Renamed annotations NumReads to READS and PassedFilterReads to PF_READS and added new annotation for number of reads on the masked (PM_READS) and aligned level (ALIGNED_PAIRS).

Lots of related changes in the code to make class and variable names match the new names.

comment:34 by Nicklas Nordborg, 10 years ago

(In [2510]) References #533: Add secondary analysis section to Reggie

Include reggie-ogs-hosts.xml in distribution.

comment:35 by Nicklas Nordborg, 10 years ago

(In [2512]) References #533: Add secondary analysis section to Reggie

Added support for using InputStream/OutputStream when reading/writing files to remote hosts.

comment:36 by Nicklas Nordborg, 10 years ago

(In [2534]) References #533 and BASE ticket http://base.thep.lu.se/ticket/1824. Call Services.restart() so that the BASE core can catch any exceptions.

comment:37 by Nicklas Nordborg, 10 years ago

(In [2562]) References #533: Add secondary analysis section to Reggie

Installing a SFT file server item 'ProjectArchive'. This contains login information for connecting to the file server where the 'project archive' is located. The file server can then be used to create external file links to FASTQ and other data files and then link them to the corresponding bioassay items.

comment:38 by Nicklas Nordborg, 10 years ago

(In [2563]) References #533: Add secondary analysis section to Reggie

Use fingerprint instead of BASE64-encoded public key to verify connections to SSH servers. Fingerprints are shorter and easier to handle is more compatible with new feature in BASE FileServer items.

comment:39 by Nicklas Nordborg, 10 years ago

(In [2564]) References #533: Add secondary analysis section to Reggie

Forgot this file as part of [2562].

comment:40 by Nicklas Nordborg, 10 years ago

(In [2565]) References #533: Add secondary analysis section to Reggie

Installing file types for FASTQ/BAM and associating them with the MergedSequences and AlignedSequences bioassay type.

comment:41 by Nicklas Nordborg, 10 years ago

Description: modified (diff)

comment:42 by Nicklas Nordborg, 10 years ago

Resolution: fixed
Status: assignedclosed

comment:43 by Nicklas Nordborg, 10 years ago

(In [2704]) References #533: Add secondary analysis section to Reggie

Failure to download result files should result in error status for the job.

comment:44 by Nicklas Nordborg, 10 years ago

(In [2709]) References #533: Add secondary analysis section to Reggie

Making installation wizard more rubust in case items has been created in non-optimal order. Eg. data file type items and some subtypes must be created in a special order, but a developer may have created some items already which are not in the expected order.

comment:45 by Nicklas Nordborg, 10 years ago

(In [2715]) References #533: Add secondary analysis section to Reggie

Installation wizard should only warn if username, etc. is not configured on ProjectArchive file server.

Note: See TracTickets for help on using tickets.