#547 closed task (fixed)
Start Demux and Merge
Reported by: | Nicklas Nordborg | Owned by: | Nicklas Nordborg |
---|---|---|---|
Priority: | major | Milestone: | Reggie v2.16 |
Component: | net.sf.basedb.reggie | Keywords: | |
Cc: |
Description (last modified by )
Part of #533.
This wizard is started after confirming a successful sequencing. The wizard should show SequencingRun
items that has a SequencingResult=Succes
annotation but no child DemuxedSequences
item. It should also be possible to manually select a SequencingRun
if the demux needs to be re-done with different parameters.
The wizard creates one DemuxedSequences
item so that we can keep track of the demux parameters and software:
- Name of item:
SequencingRunNNN.dx
,SequeningRunNNN.dx2
... - Software and protocol: (Type=Demuxing)
ReadString
: Annotation automatically calculated fromSequencingCycles
annotation on the parentSequencingRun
item
The wizard also creates one MergedSequences
for each Library
that has been sequenced:
- Name of item:
<lib-name>.g
,<lib-name>.g2
- Software and protocol: (Type=Merging)
The MergedSequences
items are child items to the DemuxedSequences
item.
This should give us enough information to be able to start the demux and merge scripts.
Parameters for the demux program:
- Data files folder (=
DataFilesFolder
annotation onSequencingRun
) - Sample sheet (exported from BASE and saved by operator to correct place on cluster). To be able to export the sample sheet the following info is needed
- Read string (=
ReadString
annotation onDemuxedSequences
) - Project (user input; default=generated from current project in BASE)
- Sequencing center (user input; default=LuBMC)
- Width (user input; default=100)
- Concentration (user input; default=12pM)
- Read string (=
In the first version the wizard could generate scripts for running the demux. One demux command is required for each flow cell. In a future version it would be nice to submit the job to the cluster automatically.
Parameters for the merge program:
Currently no information about this.
Change History (55)
comment:1 by , 11 years ago
Description: | modified (diff) |
---|
comment:2 by , 11 years ago
Milestone: | Reggie v2.15 → Reggie v2.16 |
---|
comment:3 by , 11 years ago
(In [2267]) References #547: Start Demux and Merge
First prototype of the 'start demux and merge' wizard. It allows the user to select one or more SequencingRun
items and configure (some) parameters for the demux. The final registration creates DemuxedSequences
(1 per selected sequencing run) and MergedSequences
(1 per unique library) items.
Final output is a list of shell commands to run, and these are also recorded as Job
items. Don't know how useful this is at the moment. I guess we need to know more about how the result of the demux/merge is supposed to get back to BASE is needed.
comment:4 by , 11 years ago
comment:5 by , 11 years ago
Status: | new → assigned |
---|
comment:6 by , 11 years ago
comment:7 by , 11 years ago
(In [2333]) References #547: Start Demux and Merge
Export and upload sample sheet file to cluster. Currently export using default settings, but in the future need to use information from the DemuxedSequences
and MergedSequences
items (eg ReadString
, OmitLanes
, names, etc.).
Moved the job submission to a separate transaction after creating main items, since the exporter need committed information do it's work.
comment:8 by , 11 years ago
(In [2348]) References #547: Start Demux and Merge
FlowCellSampleSheetExporter
can now use informatin about demuxed/merged sequences when generating the sample sheet file. Instead of library names it uses the MergedSequences
names which avoids overwriting existing files if a library is re-sequenced (or a flow cell i re-demultiplexed).
The real demux command is now generated and sent to the job cluster (with --debug flag). Seems to work, but problems are most likely not catched automatically, since the script seems to report success (exitstatus=0) even if it fails. But #548 still need to implement a manual inspection before the result is accepted so this can probably be catched at that level.
comment:9 by , 11 years ago
(In [2356]) References #547: Start Demux and Merge
Create a single job on the cluster for demultiplexing and merging the selected flow cells. The reason is that then we can use a temporary location for storing BAM files between the demultiplex and merge step. The merged FASTQ files are uploaded to the "project_archive" folder and the BAM files are deleted. (Note! the merge is not yet implemented)
More configuration options for in the reggie-ogs.xml file to allow detailed control of program paths and parameters.
comment:10 by , 11 years ago
(In [2359]) References #547: Start Demux and Merge
Added call to consolidate_bamfiles2fastq.sh
for converting and merging BAM files to FASTQ files. The FASTQ files are currently left at the "temporary" folder where the BAM files are created. Still need to implement 'rsync' to move them to final location.
Also prepared for some progress reporting back to BASE by writing log statements a file in the job folder.
comment:11 by , 11 years ago
(In [2361]) References #547: Start Demux and Merge
Implemented primitive progress reporting using a file that is parsed for a percentage and current status message.
Query the cluster for the current time to calculate the time difference between the BASE and cluster server. The time difference is used to correct the start and end time for the job.
Implemented final rsync of FASTQ files to the TransferDir
directory.
comment:12 by , 11 years ago
comment:14 by , 11 years ago
(In [2365]) References #547: Start Demux and Merge
Generated sript for demux now collect information about number of reads/passed filter statistics in the 'demultiplex_metrics.txt' file in job folder.
Added a JobCompletionHandler
interface for callbacks after a job has finshed. Implemented a handler for the demux job that parses the 'demultiplex_metrics.txt' and add NumReads
and PassedFilterReads
annotations to merged items.
comment:15 by , 11 years ago
(In [2372]) References #547: Start Demux and Merge
Better error handling when starting the demux job. If something happens during sample sheet export or if job fails to be queued, the job is set in error state.
Added setting the '-pe' option to qsub. This is needed to prevent too many jobs from starting on the same node at the same time. Without this option a job uses a single slot and since most nodes have 16 or 24 slots this would cause disk reading/writing to be very slow.
comment:16 by , 11 years ago
comment:17 by , 11 years ago
comment:18 by , 11 years ago
comment:19 by , 11 years ago
comment:20 by , 11 years ago
comment:21 by , 11 years ago
(In [2386]) References #547: Start Demux and Merge
Added project name parameter so we have more control of were BAM files and FASTQ files are stored. This means we can find demultiplex_metrics*.txt
files in the correct folder (not Undetermined
) and that FASTQ files can go elsewhere if needed.
The <fastq-archive>
configuration option was replaced with <project-archive>
and the project name should no longer be included in the path.
Some minor changes to the generated scripts (comments and white-space).
comment:22 by , 11 years ago
(In [2388]) References #547: Start Demux and Merge
Storing FASTQ folder as DataFilesFolder
annotation on MergedSequences
items so that we can find them in the filter step. The path is relative the <project-archive> folder from the configuration file.
This setting was moved out of the <consolidate_bamfiles2fast> section since it is more of a global setting used in multiple steps.
comment:23 by , 11 years ago
comment:24 by , 11 years ago
(In [2420]) References #533, #547, #548, #593, #595. Renamed FilteredSequences
subtype to MaskedSequences
and the related software and protocol type. Renamed annotations NumReads
to READS
and PassedFilterReads
to PF_READS
and added new annotation for number of reads on the masked (PM_READS
) and aligned level (ALIGNED_PAIRS
).
Lots of related changes in the code to make class and variable names match the new names.
comment:25 by , 11 years ago
comment:26 by , 11 years ago
comment:27 by , 11 years ago
comment:28 by , 11 years ago
comment:29 by , 11 years ago
comment:30 by , 11 years ago
(In [2441]) References #547: Start Demux and Merge
- Do not start multiple demux processes at the same time since it seems to slow down things
- Use 'set -e' to make sure job script exits when a command fails. Otheriwise it may run to the end and report a successful demuxing, report expected READS and PF_READS values, and even produce some FASTQ files but they only have partial data in them.
- Do not redirect stderr from picard to the *.out file. Better if it ends up in the glocal stderr file so BASE can show some error information beside the exit code
comment:31 by , 11 years ago
comment:32 by , 11 years ago
comment:33 by , 11 years ago
comment:34 by , 11 years ago
comment:35 by , 11 years ago
comment:36 by , 11 years ago
comment:37 by , 11 years ago
(In [2464]) References #547: Start Demux and Merge
Do not set TILE_LIMIT for ExtractIlluminaBarcodes. It doesn't save much time and there should now be enough disk space on test node (and we don't have to fix support for this in Picard, since it is not in the default distribution).
Set default log level to WARNING.
comment:38 by , 11 years ago
(In [2497]) References #547: Start Demux and Merge
Added trimmomatic to the script. This is called after the merge. Since the merge and copy to project_archive happened at the same time before the needed some bigger changes.
After the demux all FASTQ files are in the FASTQ folder (we can no longer use the COMPRESS_OUTPUTS option since trimmomatic don't like concatenated gz files).
The second step is to merge and compress the FASTQ files into 'fastq.merged' directory.
The third step runs trimmomatic which save the modified FASTQ files (compressed) in the 'fastq.trimmomatic' directory. Options for trimmomatic are configured in the reggie-ogs-hosts.xml file. Information about what has happened is written to a log file 'trimmomatic.out' in the job folder. When the job has been completed this information is parsed and number of surviving reads are stored in annotation PT_READS.
Last step is to simply copy the generated files to the project_archive.
comment:39 by , 11 years ago
comment:40 by , 10 years ago
(In [2560]) References #547 and #593. Do not use more threads than the number of slots that has been assigned by the queue system.
The number of slots that has been assigned is present in the NSLOTS enviroment variable and this is compared to the number of cores on the node. The smaller number is selected.
comment:41 by , 10 years ago
comment:42 by , 10 years ago
comment:43 by , 10 years ago
comment:44 by , 10 years ago
comment:45 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
comment:46 by , 10 years ago
comment:47 by , 10 years ago
comment:48 by , 10 years ago
comment:49 by , 10 years ago
(In [2636]) References #571 and #547 and #548. The confirm demux wizard has been updated to the new script and style pattern.
Also added an option to delete items after a failed job and fixed the script so that it makes sure that the job, work and archive folders are empty before trying to use them.
comment:50 by , 10 years ago
comment:51 by , 10 years ago
comment:52 by , 10 years ago
(In [2680]) References #547: Start Demux and Merge
Calling bowtie to align a small portion of the reads (default 100000, can be configured in reggie-ogs-hosts.xml). From the generated SAM file we can then estimate the average and standard distribution of fragment size.
This information is imported into BASE as FragmentSizeAvg
and FragmentSizeStdev
annotations on MergedSequences
items.
The information is also saved alongside the fastq files in the project archive.
Updated information about parameters for the demux program.