Opened 11 years ago
Last modified 10 years ago
#533 closed task
Add secondary analysis section to Reggie — at Version 28
Reported by: | Nicklas Nordborg | Owned by: | Nicklas Nordborg |
---|---|---|---|
Priority: | critical | Milestone: | Reggie v2.16 |
Component: | net.sf.basedb.reggie | Keywords: | |
Cc: |
Description (last modified by )
This is the master ticket for adding secondary analysis registration functionality to Reggie. The secondary analysis is the steps done from sequencing down to expression values have been generated, including demux and alignment against a reference genome.
Note! Primary analysis is the base calling performed by the Illumina software during the sequencing.
The pipeline will be something like this. See the other tickets (to be created) for more information about each step:
- (#545) Register sequencing as ended. Part of the "Library preparation wizards" section and done by someone in the lab.
- (#546) Confirm sequencing as completed. First wizard in the "Secondary analysis wizards" section. Used to decide if the sequenced data is ok or not. If ok, continue with demuxing, otherwise flag pools for re-sequencing.
- (#547) Start demux and merge. This wizard starts the demux and merge operations.
- (#548) Register demux and merge as ended. At the end we have one "MergedSequences" item for each "Library" from the flow cells that was sequenced. A count of the number of reads for each library must be recorded and is used to determine if the library needs to be re-sequenced or not. FASTQ files for each library are stored on the server.
- (#593) Start filtering and alignment. Bowtie and TopHat is used to first filter and then align against a pre-defined set of transcripts.
- Register filtering and alignment as ended. At the end we have one "AlignedSequences" item for each "Library" from the flow cells that was sequenced. BAM files for each library are stored on the server.
- Start feature extraction. Cufflinks is used to calculate expression values.
- Register feature extraction. At the end we have one "RawBioAssay" item for each "Library" from the flow cells that was sequenced. FPKM files are uploaded to BASE and imported into the database.
Change History (30)
comment:1 by , 11 years ago
Milestone: | Reggie v2.x → Reggie v2.15 |
---|
comment:2 by , 11 years ago
Description: | modified (diff) |
---|---|
Summary: | Add primary analysis section to Reggie → Add secondary analysis section to Reggie |
comment:3 by , 11 years ago
Description: | modified (diff) |
---|
comment:4 by , 11 years ago
Description: | modified (diff) |
---|
comment:5 by , 11 years ago
Description: | modified (diff) |
---|
comment:6 by , 11 years ago
Description: | modified (diff) |
---|
by , 11 years ago
Attachment: | secondary-analysis-overview-v1.pdf added |
---|
comment:7 by , 11 years ago
(In [2196]) References #533: Add secondary analysis section to Reggie
Re-arranged the index page to make room for the "Secondary analysis wizards" section. The page is now divided in three columns with the new section as the middle column.
Added link to "Register sequencing ended" wizard as the last library preparation wizard. Created two new annotation so that the counter can show a correct value.
by , 11 years ago
Attachment: | secondary-analysis-overview-v2.pdf added |
---|
Second version of the overview
comment:8 by , 11 years ago
(In [2225]) References #533: Register sequencing as ended
New version of the 'sequencing ended' wizard. Added an option to specify if first base report failed. If so, the 'sequencing startup' wizard is enabled again for the flow cell. Other changes are mostly related to that there is only a single parent FlowCell
for each SequencingRun
.
comment:9 by , 11 years ago
Milestone: | Reggie v2.15 → Reggie v2.16 |
---|
comment:10 by , 11 years ago
comment:11 by , 11 years ago
(In [2293]) References #533: Add secondary analysis section to Reggie
Adding some classes for keeping track of SSH/Open grid scheduler hosts. With a properly configured configuration file (reggie-ogs-hosts.xml) it seems to be possible to connect and execute a simple command on the remote server. Not much error handling though.
comment:12 by , 11 years ago
Status: | new → assigned |
---|
comment:13 by , 11 years ago
comment:14 by , 11 years ago
(In [2296]) References #533: Add secondary analysis section to Reggie
Added a service extension for OpenGridService
so that we can control it from the web interface (eg. reload settings after a configuration change).
Added OpenGridSignalHandlerFactory
for taking care of ABORT and STATUS updates for jobs on a cluster. The demux servlet is used as a test bed but currently just faking communication with the cluster.
comment:15 by , 11 years ago
comment:16 by , 11 years ago
comment:17 by , 11 years ago
(In [2299]) References #533: Add secondary analysis section to Reggie
Test for adding a "real" fake job to the cluster queue. The job id is stored as the Job.externalId
in BASE and in the 'signal handler'.
The OpenGridSignalHandler
class has been updated with "real" support for the "ABORT" signal. Seems to work, but error handling is non-existent.
Added logging to make debugging easier.
comment:18 by , 11 years ago
(In [2300]) References #533: Add secondary analysis section to Reggie
First attempts to keep track of job status. Added OpenGridCluster.submitJob()
to start jobs.
Added OpenGridService.jobStatusTimer
and OpenGridCluster.updateJobStatus()
that is used for regular checking with the cluster. Sending 'qstat -xml' should get a list of queued and running jobs.
Internally we keep a list of job ids that the BASE server has been interested in and if some of those were not listed in 'qstat' it is probably because the job has ended. We need to get information about that using 'qacct' (to be implemented) and possible with other commands. For debugging purposes we simply set those to DONE since they would otherwise be left in EXECUTING state forever.
comment:19 by , 11 years ago
(In [2301]) References #533: Add secondary analysis section to Reggie
Avoid duplicating status checks with the cluster (when 'qacct' has been impemented) if the BASE server is submitting status update requests with higher frequence than the actual status updating is happening on our end (eg. if getJobStatus()
is called by another thread while updateJobStatus()
is executed by the timer thread).
comment:20 by , 11 years ago
comment:21 by , 11 years ago
comment:22 by , 11 years ago
(In [2309]) References #533: Add secondary analysis section to Reggie
Using the new ability to access BASE from a service extension (http://base.thep.lu.se/ticket/1799) to update job status.
Improved error handling a bit. If a job fails the first part of stderr is used as an error message to the user.
comment:23 by , 11 years ago
comment:24 by , 11 years ago
comment:25 by , 11 years ago
(In [2324]) References #533: Add secondary analysis section to Reggie
Creating a separate working folder for each job. The configuration file need to specify a base directory in attribute 'job-folder'. Each job get a subdirectory in this folder: <job-folder>/job-name. The auto-generated job script is always named 'job.sh' and output streams 'stdout' and 'stderr'.
comment:26 by , 11 years ago
comment:27 by , 11 years ago
(In [2336]) References #533: Add secondary analysis section to Reggie
Last checkin [2325] didn't solve the problem of missing jobs. Seems like there is a delay before information about finished jobs are written to the log file (https://arc.liv.ac.uk/pipermail/gridengine-users/2005-August/006253.html). The default value for the flush_time seems to be 15 seconds.
Instead of changing the Thread.sleep time, the code now accepts a first "job not found" error but if it happens a second time (which it shouldn't since we ony check once per minute) the job is set to error status.
comment:28 by , 11 years ago
Description: | modified (diff) |
---|
Overview PDF v1