Opened 11 years ago

Closed 10 years ago

Last modified 10 years ago

#547 closed task (fixed)

Start Demux and Merge

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: Reggie v2.16
Component: net.sf.basedb.reggie Keywords:
Cc:

Description (last modified by Nicklas Nordborg)

Part of #533.

This wizard is started after confirming a successful sequencing. The wizard should show SequencingRun items that has a SequencingResult=Succes annotation but no child DemuxedSequences item. It should also be possible to manually select a SequencingRun if the demux needs to be re-done with different parameters.

The wizard creates one DemuxedSequences item so that we can keep track of the demux parameters and software:

  • Name of item: SequencingRunNNN.dx, SequeningRunNNN.dx2 ...
  • Software and protocol: (Type=Demuxing)
  • ReadString: Annotation automatically calculated from SequencingCycles annotation on the parent SequencingRun item

The wizard also creates one MergedSequences for each Library that has been sequenced:

  • Name of item: <lib-name>.g, <lib-name>.g2
  • Software and protocol: (Type=Merging)

The MergedSequences items are child items to the DemuxedSequences item.

This should give us enough information to be able to start the demux and merge scripts.

Parameters for the demux program:

  • Data files folder (=DataFilesFolder annotation on SequencingRun)
  • Sample sheet (exported from BASE and saved by operator to correct place on cluster). To be able to export the sample sheet the following info is needed
    • Read string (=ReadString annotation on DemuxedSequences)
    • Project (user input; default=generated from current project in BASE)
    • Sequencing center (user input; default=LuBMC)
    • Width (user input; default=100)
    • Concentration (user input; default=12pM)

In the first version the wizard could generate scripts for running the demux. One demux command is required for each flow cell. In a future version it would be nice to submit the job to the cluster automatically.

Parameters for the merge program:

Currently no information about this.

Change History (55)

comment:1 by Nicklas Nordborg, 11 years ago

Description: modified (diff)

Updated information about parameters for the demux program.

comment:2 by Nicklas Nordborg, 11 years ago

Milestone: Reggie v2.15Reggie v2.16

comment:3 by Nicklas Nordborg, 11 years ago

(In [2267]) References #547: Start Demux and Merge

First prototype of the 'start demux and merge' wizard. It allows the user to select one or more SequencingRun items and configure (some) parameters for the demux. The final registration creates DemuxedSequences (1 per selected sequencing run) and MergedSequences (1 per unique library) items.

Final output is a list of shell commands to run, and these are also recorded as Job items. Don't know how useful this is at the moment. I guess we need to know more about how the result of the demux/merge is supposed to get back to BASE is needed.

comment:4 by Nicklas Nordborg, 11 years ago

(In [2268]) References #547: Start Demux and Merge

Forgot this file in last checkin [2267].

comment:5 by Nicklas Nordborg, 11 years ago

Status: newassigned

comment:6 by Nicklas Nordborg, 11 years ago

(In [2320]) References #547: Start Demux and Merge

Added option to select Open Grid cluster and omit lanes when demuxing.

comment:7 by Nicklas Nordborg, 11 years ago

(In [2333]) References #547: Start Demux and Merge

Export and upload sample sheet file to cluster. Currently export using default settings, but in the future need to use information from the DemuxedSequences and MergedSequences items (eg ReadString, OmitLanes, names, etc.).

Moved the job submission to a separate transaction after creating main items, since the exporter need committed information do it's work.

comment:8 by Nicklas Nordborg, 11 years ago

(In [2348]) References #547: Start Demux and Merge

FlowCellSampleSheetExporter can now use informatin about demuxed/merged sequences when generating the sample sheet file. Instead of library names it uses the MergedSequences names which avoids overwriting existing files if a library is re-sequenced (or a flow cell i re-demultiplexed).

The real demux command is now generated and sent to the job cluster (with --debug flag). Seems to work, but problems are most likely not catched automatically, since the script seems to report success (exitstatus=0) even if it fails. But #548 still need to implement a manual inspection before the result is accepted so this can probably be catched at that level.

comment:9 by Nicklas Nordborg, 11 years ago

(In [2356]) References #547: Start Demux and Merge

Create a single job on the cluster for demultiplexing and merging the selected flow cells. The reason is that then we can use a temporary location for storing BAM files between the demultiplex and merge step. The merged FASTQ files are uploaded to the "project_archive" folder and the BAM files are deleted. (Note! the merge is not yet implemented)

More configuration options for in the reggie-ogs.xml file to allow detailed control of program paths and parameters.

comment:10 by Nicklas Nordborg, 11 years ago

(In [2359]) References #547: Start Demux and Merge

Added call to consolidate_bamfiles2fastq.sh for converting and merging BAM files to FASTQ files. The FASTQ files are currently left at the "temporary" folder where the BAM files are created. Still need to implement 'rsync' to move them to final location.

Also prepared for some progress reporting back to BASE by writing log statements a file in the job folder.

comment:11 by Nicklas Nordborg, 11 years ago

(In [2361]) References #547: Start Demux and Merge

Implemented primitive progress reporting using a file that is parsed for a percentage and current status message.

Query the cluster for the current time to calculate the time difference between the BASE and cluster server. The time difference is used to correct the start and end time for the job.

Implemented final rsync of FASTQ files to the TransferDir directory.

comment:12 by Nicklas Nordborg, 11 years ago

(In [2362]) References #547: Start Demux and Merge

Added ScriptBuilder for more stable generation of shell scripts.

Added configuration options for the consolidate BAM -> FASTQ step.

comment:13 by Nicklas Nordborg, 11 years ago

(In [2363]) References #547: Start Demux and Merge

Cleaning up a bit.

comment:14 by Nicklas Nordborg, 11 years ago

(In [2365]) References #547: Start Demux and Merge

Generated sript for demux now collect information about number of reads/passed filter statistics in the 'demultiplex_metrics.txt' file in job folder.

Added a JobCompletionHandler interface for callbacks after a job has finshed. Implemented a handler for the demux job that parses the 'demultiplex_metrics.txt' and add NumReads and PassedFilterReads annotations to merged items.

comment:15 by Nicklas Nordborg, 11 years ago

(In [2372]) References #547: Start Demux and Merge

Better error handling when starting the demux job. If something happens during sample sheet export or if job fails to be queued, the job is set in error state.

Added setting the '-pe' option to qsub. This is needed to prevent too many jobs from starting on the same node at the same time. Without this option a job uses a single slot and since most nodes have 16 or 24 slots this would cause disk reading/writing to be very slow.

comment:16 by Nicklas Nordborg, 11 years ago

(In [2373]) References #547: Start Demux and Merge

Summarize number of reads/passed filter per flow cell and put that as annotations on the DemuxedSequnces items.

comment:17 by Nicklas Nordborg, 11 years ago

(In [2377]) References #547: Start Demux and Merge

Added 'debug' option to the wizard.

comment:18 by Nicklas Nordborg, 11 years ago

(In [2378]) References #547: Start Demux and Merge

Added 'priority' option to the wizard.

comment:19 by Nicklas Nordborg, 11 years ago

(In [2384]) References #547: Start Demux and Merge

For now, always use tmp folder for BAM folder since trying to set those different messed things up.

Try to read progress the first time the job enters EXECUTING status.

comment:20 by Nicklas Nordborg, 11 years ago

(In [2385]) References #547: Start Demux and Merge

Debug option is automatically selected if debug flag is set or not running on the production server (=https protocol).

comment:21 by Nicklas Nordborg, 11 years ago

(In [2386]) References #547: Start Demux and Merge

Added project name parameter so we have more control of were BAM files and FASTQ files are stored. This means we can find demultiplex_metrics*.txt files in the correct folder (not Undetermined) and that FASTQ files can go elsewhere if needed.

The <fastq-archive> configuration option was replaced with <project-archive> and the project name should no longer be included in the path.

Some minor changes to the generated scripts (comments and white-space).

comment:22 by Nicklas Nordborg, 11 years ago

(In [2388]) References #547: Start Demux and Merge

Storing FASTQ folder as DataFilesFolder annotation on MergedSequences items so that we can find them in the filter step. The path is relative the <project-archive> folder from the configuration file.

This setting was moved out of the <consolidate_bamfiles2fast> section since it is more of a global setting used in multiple steps.

comment:23 by Nicklas Nordborg, 11 years ago

(In [2391]) References #547: Start Demux and Merge

Disable all form controls in the last step.

comment:24 by Nicklas Nordborg, 11 years ago

(In [2420]) References #533, #547, #548, #593, #595. Renamed FilteredSequences subtype to MaskedSequences and the related software and protocol type. Renamed annotations NumReads to READS and PassedFilterReads to PF_READS and added new annotation for number of reads on the masked (PM_READS) and aligned level (ALIGNED_PAIRS).

Lots of related changes in the code to make class and variable names match the new names.

comment:25 by Nicklas Nordborg, 11 years ago

(In [2422]) References #547: Start Demux and Merge

Export barcode files needed by Picard instead of the illumina wrapper. Early exit in the script since we need to change a lot of things in the script to use Picard.

comment:26 by Nicklas Nordborg, 11 years ago

(In [2429]) References #547: Start Demux and Merge

Switched from illumina2bam_wrapper to picard for demultiplexing.

comment:27 by Nicklas Nordborg, 11 years ago

(In [2434]) References #547: Start Demux and Merge

Adding PF_NNNN_PCT, PF_UNUSED_PCT and DEMUX_WARNINGS annotation to be used for troubleshooting problems with the demux.

comment:28 by Nicklas Nordborg, 11 years ago

(In [2435]) References #547: Start Demux and Merge

Saving demultiplex_metrics file to the BASE file system and attaching it to the DemuxedSequences item.

comment:29 by Nicklas Nordborg, 11 years ago

(In [2436]) References #547: Start Demux and Merge

Do not skip last base in barcode read when demultiplexing NextSeq data.

comment:30 by Nicklas Nordborg, 10 years ago

(In [2441]) References #547: Start Demux and Merge

  • Do not start multiple demux processes at the same time since it seems to slow down things
  • Use 'set -e' to make sure job script exits when a command fails. Otheriwise it may run to the end and report a successful demuxing, report expected READS and PF_READS values, and even produce some FASTQ files but they only have partial data in them.
  • Do not redirect stderr from picard to the *.out file. Better if it ends up in the glocal stderr file so BASE can show some error information beside the exit code

comment:31 by Nicklas Nordborg, 10 years ago

(In [2443]) References #547: Start Demux and Merge

Generate .gz compressed files directly from Picard.

Do not start 'cat' processes in background since it will not go any faster.

Added <picard-memory> configuration option.

comment:32 by Nicklas Nordborg, 10 years ago

(In [2445]) References #547: Start Demux and Merge

Use 'getFile' instead of 'getNew' to avoid exception if the file already exists.

comment:33 by Nicklas Nordborg, 10 years ago

(In [2446]) References #547: Start Demux and Merge

Adding 'rm <tmp-folder>' back to script when not debugging.

comment:34 by Nicklas Nordborg, 10 years ago

(In [2455]) References #547: Start Demux and Merge

Avoid 'find' command to choke on permission denied errors.

comment:35 by Nicklas Nordborg, 10 years ago

(In [2456]) References #547: Start Demux and Merge

Trim whitespace values from some values parsed from qstat/qacct output. Don't know why they are there sometimes but not always.

comment:36 by Nicklas Nordborg, 10 years ago

(In [2457]) References #547: Start Demux and Merge

Get rid of problematic characters in the sequencer name

comment:37 by Nicklas Nordborg, 10 years ago

(In [2464]) References #547: Start Demux and Merge

Do not set TILE_LIMIT for ExtractIlluminaBarcodes. It doesn't save much time and there should now be enough disk space on test node (and we don't have to fix support for this in Picard, since it is not in the default distribution).

Set default log level to WARNING.

comment:38 by Nicklas Nordborg, 10 years ago

(In [2497]) References #547: Start Demux and Merge

Added trimmomatic to the script. This is called after the merge. Since the merge and copy to project_archive happened at the same time before the needed some bigger changes.

After the demux all FASTQ files are in the FASTQ folder (we can no longer use the COMPRESS_OUTPUTS option since trimmomatic don't like concatenated gz files).

The second step is to merge and compress the FASTQ files into 'fastq.merged' directory.

The third step runs trimmomatic which save the modified FASTQ files (compressed) in the 'fastq.trimmomatic' directory. Options for trimmomatic are configured in the reggie-ogs-hosts.xml file. Information about what has happened is written to a log file 'trimmomatic.out' in the job folder. When the job has been completed this information is parsed and number of surviving reads are stored in annotation PT_READS.

Last step is to simply copy the generated files to the project_archive.

comment:39 by Nicklas Nordborg, 10 years ago

(In [2513]) References #547: Start Demux and Merge

Removing debug messages.

comment:40 by Nicklas Nordborg, 10 years ago

(In [2560]) References #547 and #593. Do not use more threads than the number of slots that has been assigned by the queue system.

The number of slots that has been assigned is present in the NSLOTS enviroment variable and this is compared to the number of cores on the node. The smaller number is selected.

comment:41 by Nicklas Nordborg, 10 years ago

(In [2566]) References #547: Start Demux and Merge

Specify project name in the configuration file instead of in the wizard. The 'debug' parameter now results in a subdirectory inside the project directory instead of a sibling directory.

comment:42 by Nicklas Nordborg, 10 years ago

(In [2579]) References #547: Start Demux and Merge

Use temporary filter in selection dialog to avoid destroying existing filter in normal browsing.

comment:43 by Nicklas Nordborg, 10 years ago

(In [2581]) References #547: Start Demux and Merge

Use different number of tiles when debugging for HiSeq and NextSeq data since one tile of HiSeq data contains much more sequences and takes much longer time to process than one tile of NextSeq data.

comment:44 by Nicklas Nordborg, 10 years ago

(In [2582]) References #547 and #593.

Use temp folder provided by the grid scheduler by default. The benefit is that it will automatically be cleaned up when the job finished. For debuging, it is possible to set a different folder, since it may be desired to keep all intermediate data files.

comment:45 by Nicklas Nordborg, 10 years ago

Resolution: fixed
Status: assignedclosed

comment:46 by Nicklas Nordborg, 10 years ago

(In [2626]) References #547 and #593. Use the "Send message when plugin completes" setting from the web client with jobs.

comment:47 by Nicklas Nordborg, 10 years ago

(In [2627]) References #547: Start Demux and Merge

Sequencing runs are sorted by the first pool name.

comment:48 by Nicklas Nordborg, 10 years ago

(In [2628]) References #547: Start Demux and Merge

Display warning if the selected sequencing runs doesn't have the same pools.

comment:49 by Nicklas Nordborg, 10 years ago

(In [2636]) References #571 and #547 and #548. The confirm demux wizard has been updated to the new script and style pattern.

Also added an option to delete items after a failed job and fixed the script so that it makes sure that the job, work and archive folders are empty before trying to use them.

comment:50 by Nicklas Nordborg, 10 years ago

(In [2654]) References #547: Start Demux and Merge

Added 'ulimit' option for the demux script. If no limit is specified this command is not included in the script and the default limit on the server is used.

comment:51 by Nicklas Nordborg, 10 years ago

(In [2655]) References #547: Start Demux and Merge

Reset the AutoProcessing annotation when starting a demux, since otherwise the sequencing runs flagged for re-processing will not disappear from the list.

comment:52 by Nicklas Nordborg, 10 years ago

(In [2680]) References #547: Start Demux and Merge

Calling bowtie to align a small portion of the reads (default 100000, can be configured in reggie-ogs-hosts.xml). From the generated SAM file we can then estimate the average and standard distribution of fragment size.

This information is imported into BASE as FragmentSizeAvg and FragmentSizeStdev annotations on MergedSequences items.

The information is also saved alongside the fastq files in the project archive.

comment:53 by Nicklas Nordborg, 10 years ago

(In [2689]) References #547: Start Demux and Merge

Splitting Trimmomatic into two separate runs. First one should filter on Illumina adapters only, and the second one all other filters.

Added annotation 'ADAPTER_READS' to store the number of reads filtered out in the first trimmomatic step.

comment:54 by Nicklas Nordborg, 10 years ago

(In [2729]) References #547, #593 and #631. Check that values from the database such as paths and names of items are safe to use in scripts. The check is currently very strict and only allow letters, digits, dot, underscore and hyphen.

comment:55 by Nicklas Nordborg, 10 years ago

(In [2730]) References #547: Start Demux and Merge

Add 'rm' commands in the generated script at strategic points to remove files that are no longer needed. This should decrease the amount of disk space that is needed.

Note: See TracTickets for help on using tickets.