Opened 11 years ago

Closed 10 years ago

Last modified 10 years ago

#593 closed task (fixed)

Start masking and alignment

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: Reggie v2.16
Component: net.sf.basedb.reggie Keywords:
Cc:

Description (last modified by Nicklas Nordborg)

Part of #533.

The wizard is started after successfully demultiplexing and merging data from a sequencing run. In the first step MergedSequences items should be selected. The items need to be annotated with AnalysisResult=Successful and AutoProcessing!=Disable (eg. due to manually deselecting some libraries in the demux ended wizard).

For each selected item, the wizard create one MaskedSequences child item:

  • Name of item: <lib-name>.g.k, <lib-name>.g.k2
  • Software and protocol: (Type=Masking)

The wizard also create one AlignedSequences grandchild item:

  • Name of item: <lib-name>.g.k.a, <lib-name>.g.k.a2
  • Software and protocol: (Type=Alignment)

Parameters for the filter step:

  • Target genome that filters away unwanted sequences. This could be hardcoded into the script, a configuration setting, or user selectable in the wizard.
  • Other parameters ????

Parameters for the alignment step:

  • Target genome to align against. This could be hardcoded into the script, a configuration setting, or user selectable in the wizard.
  • Location were the final result files should be stored. This should probably be a configuration setting.
  • Other parameters ????

After the alignment is done some information may be imported back to BASE. It would be nice to have number of aligned sequences, and possible some other information that is not yet decided.

Change History (27)

comment:1 by Nicklas Nordborg, 11 years ago

Status: newassigned

comment:2 by Nicklas Nordborg, 11 years ago

(In [2375]) References #593: Start filter and alignment

Started to implement this wizard. The index page shows the count and the wizard display the merged sequences waiting for alignment. Manual selection is possible. Registration does nothing.

comment:3 by Nicklas Nordborg, 11 years ago

(In [2389]) References #593: Start filter and alignment

Added 'job priority' and 'debug' options to wizard. Started with the servlet for creating the FilteredSequences and AlignedSequences items.

A job script is generated and submitted to the cluster, but except for copying FASTQ files to the node this currently does nothing since not everything is in place on the cluster.

comment:4 by Nicklas Nordborg, 11 years ago

(In [2394]) References #593: Start filter and alignment

Generate a script that is working for the filter step. The filtered_<lib-name>.out file is parsed for number of remaining reads which is stored as NumReads annotation on the FilteredSequences item.

comment:5 by Nicklas Nordborg, 11 years ago

(In [2397]) References #593: Start filter and alignment

Now running tophat and sync files back to project_archive. Using tophat_single.sh instead of tophat.sh so we don't have to mess with samplesheet.csv.

comment:6 by Nicklas Nordborg, 11 years ago

(In [2399]) References #593: Start filter and alignment

Running statistics_tophat.sh to get some information about aligned reads that we can import back to BASE (as NumReads annotation on AlignedSequences).

comment:7 by Nicklas Nordborg, 11 years ago

(In [2414]) References #593: Start filter and alignment

Removed 'filter_' prefix in the 'PE_filter' script.

comment:8 by Nicklas Nordborg, 11 years ago

Summary: Start filter and alignmentStart masking and alignment

comment:9 by Nicklas Nordborg, 11 years ago

Description: modified (diff)

comment:10 by Nicklas Nordborg, 11 years ago

(In [2420]) References #533, #547, #548, #593, #595. Renamed FilteredSequences subtype to MaskedSequences and the related software and protocol type. Renamed annotations NumReads to READS and PassedFilterReads to PF_READS and added new annotation for number of reads on the masked (PM_READS) and aligned level (ALIGNED_PAIRS).

Lots of related changes in the code to make class and variable names match the new names.

comment:11 by Nicklas Nordborg, 10 years ago

(In [2460]) References #593: Start masking and alignment

Improved error handling in script. Check that data folder exists before starting.

comment:12 by Nicklas Nordborg, 10 years ago

(In [2535]) References #593: Start masking and alignment

Reset AutoProcessing annotation when starting an alignment so that the bioassay disappears from the "Start masking and alignment" count and list.

Also use DISTINCT when counting or loading the list since otherwise the same bioassay will appear multiple times after a re-alignment.

comment:13 by Nicklas Nordborg, 10 years ago

(In [2560]) References #547 and #593. Do not use more threads than the number of slots that has been assigned by the queue system.

The number of slots that has been assigned is present in the NSLOTS enviroment variable and this is compared to the number of cores on the node. The smaller number is selected.

comment:14 by Nicklas Nordborg, 10 years ago

(In [2568]) References #593: Start masking and alignment

Changes related to [2566] to ensure that paths to data files are correct.

comment:15 by Nicklas Nordborg, 10 years ago

(In [2580]) References #593: Start masking and alignment

Use temporary filter in selection dialog to avoid destroying existing filter in normal browsing.

comment:16 by Nicklas Nordborg, 10 years ago

(In [2582]) References #547 and #593.

Use temp folder provided by the grid scheduler by default. The benefit is that it will automatically be cleaned up when the job finished. For debuging, it is possible to set a different folder, since it may be desired to keep all intermediate data files.

comment:17 by Nicklas Nordborg, 10 years ago

Resolution: fixed
Status: assignedclosed

comment:18 by Nicklas Nordborg, 10 years ago

(In [2625]) References #593: Start masking and alignment

Add -f flag to 'mv' command to ensure that the old 'accepted_hits.bam' file is replaced.

comment:19 by Nicklas Nordborg, 10 years ago

(In [2626]) References #547 and #593. Use the "Send message when plugin completes" setting from the web client with jobs.

comment:20 by Nicklas Nordborg, 10 years ago

(In [2630]) References #593: Start masking and alignment

Save aligned data from tophat to 'kN.a' subfolder instead of 'tophat.kN' (N is only used if 2 or higher).

comment:21 by Nicklas Nordborg, 10 years ago

(In [2633]) References #593 and #595. Added "delete items created by failed jobs" option to alignment confirmation wizard.

This will delete MaskedSequences and AlignedSequences items so that the database is not filled up with unintersting items.

Re-starting the alignment will create new items with the same names so the script sent to the cluster has been modified so that it makes sure that the folders it is going to use are empty before starting to add data to them. Eg:

mkdir -p folder
rm -rf folder/*

comment:22 by Nicklas Nordborg, 10 years ago

(In [2635]) References #593 and #595. Changes in [2633] that ensure job folders are empty also deleted sample sheet files uploaded by the demux script. The job definition has been modified so that files that are needed by the job must be part of the definition and uploaded at the same time as the job is sent to the cluster.

comment:23 by Nicklas Nordborg, 10 years ago

(In [2684]) References #593: Start masking and alignment

Use the average fragment size and standard deviation calculated by bowtie as input parametrs to tophat. Also corrects for read-string.

comment:24 by Nicklas Nordborg, 10 years ago

(In [2697]) References #593: Start masking and alignment

Adding 'fix_tophat_unmapped_reads.py' to the generated script.

comment:25 by Nicklas Nordborg, 10 years ago

(In [2706]) References #593 and #595. Parse 'accepted_hits_picardmetrics.csv' and read out three values:

  • READ_PAIRS_EXAMINED
  • READ_PAIR_DUPLICATES
  • PERCENT_DUPLICATION (FRACTION_DUPLICATION)

The values are stored in annotations on the AlignedSequences item. Note that what Picard says is a percentage is actually a fraction.

comment:26 by Nicklas Nordborg, 10 years ago

(In [2727]) References #593: Start masking and alignment

Switches to hg38 in the default configuration.

comment:27 by Nicklas Nordborg, 10 years ago

(In [2729]) References #547, #593 and #631. Check that values from the database such as paths and names of items are safe to use in scripts. The check is currently very strict and only allow letters, digits, dot, underscore and hyphen.

Note: See TracTickets for help on using tickets.