#997 closed task (fixed)

Implement new secondary analysis pipeline

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: critical Milestone: Reggie v4.12
Component: net.sf.basedb.reggie Keywords:
Cc:

Description

The new secondary analysis pipeline should be implemented. It starts with MergedSequences. The first step is similar to the old Mask+Align step but uses Hisat instead of Tophat. The post-processing scripts that is run afterwards for collecting statistics should not have to be changed but this needs to be verified. Just as for the original pipeline there is a breakpoint after the alignment step which means we have to store result files back to the project archive.

Auto-confirm rules should be considered but we may want to start with manual confirmation.

The second step is to calculate expression values with Stringtie. We need a new raw data type to be able to separate this from the cufflinks data. We need to investigate what kind of files that are produced by Stringtie and which files we should define as file types and which files that should only be generically linked.

Since the new pipeline is going to live alongside the legacy pipeline we need a way to separate items. We can, for example, define new subtypes for items belonging to the new pipeline. While this makes it relatively easy to implement the new pipeline there are some drawbacks in other areas:

  • The current structure (RawBioAssay -> AlignedSequences -> MaskedSequences -> MergedSequences) is built into a lot of other places like the case summary, yellow label wizard, release exporter, etc. Introducing new subtypes will require changes in several other places to make them behave as we want to.
  • We also need new subtypes for protocols and software.
  • If we add more pipelines in the future there is going to be a "subtype explosion" which will make the first point above even more complex to handle.

Another possibility is to keep and re-use the current subtypes and maybe use an annotation to indicate which pipeline an item belongs to. We still need to check the case summary, yellow label wizard, etc. but I think fewer changes are needed. More care is needed when implementing the new pipeline since we really have to be sure that items are not mixed up and suddenly starts to be processed by the incorrect pipeline.

Change History (18)

comment:1 Changed 19 months ago by Nicklas Nordborg

(In [4587]) References #997: Implement new secondary analysis pipeline

Adding configuration settings for the Hisat alignment. Options go into the <align-hisat> tag and are similar to options alread present for the legacy tophat alignment.

comment:2 Changed 19 months ago by Nicklas Nordborg

(In [4588]) References #997: Implement new secondary analysis pipeline

Adding a new Job subtype "HisatAlign?" to be used for the new pipeline.

comment:3 Changed 19 months ago by Nicklas Nordborg

(In [4589]) References #997: Implement new secondary analysis pipeline

Updated the script so that it can work with either 'accepted_hits.bam' or 'alignment.bam'. Also, if there is no 'unmapped.bam' file it will simply skip it.

comment:4 Changed 19 months ago by Nicklas Nordborg

(In [4590]) References #997: Implement new secondary analysis pipeline

Added HisatAlignJobCreator which should be used to create and submit scripts for Hisat alignment to the cluster. This class is very similar to the old AlignJobCreator that existed before the old pipeline merged all steps into a single job.

Input parameters to the job creator are a Software, a Protocol and a list of MergedSequences items. One job is created for each merged item. It will use options from the <align-hisat> configuration section and supports "ParameterSet?" via the annotation on the Software item. The script will create one child AlignedSequences item for each merged sequences.

After copying the FASTQ files from the project archive to the cluster node it will run Hisat (no masking step). The main file generated is aligned/alignment.bam. After that it will also run picard MarkDuplicates and picard AddOrReplaceReadGroups. The last step is to extract some statistics.

When the job has been completed Reggie will import some values and link files to the aligned sequences item. The major difference is that no PM_READS value can be set since the masking step is missing. Hisat also produces a different set of files.

Note! The is not yet any way to invoke this pipeline via the Reggie web interface.

comment:5 Changed 19 months ago by Nicklas Nordborg

(In [4591]) References #997: Implement new secondary analysis pipeline

Samtools updated to version 1.4.

comment:6 Changed 19 months ago by Nicklas Nordborg

(In [4592]) References #997: Implement new secondary analysis pipeline

Removed the AddOrReplaceReadGroups step since it is possible to include it in the Hisat alignment. FlowCellId is a new parameter. If a sample has been sequenced in multiple sequencing runs the id is created by concatenating the flow cell id from all runs.

comment:7 Changed 19 months ago by Nicklas Nordborg

(In [4593]) References #997: Implement new secondary analysis pipeline

Creating index file with samtools to replaced the one that was previously created as a "side-effect" by the AddOrReplaceReadGroups step.

comment:8 Changed 19 months ago by Nicklas Nordborg

(In [4594]) References #997: Implement new secondary analysis pipeline

Started to implement a wizard for manually starting Hisat alignment. It is basically the same as the old wizard that started Tophat alignment.

There is not yet support for detecting which merged sequences that should be aligned. Protocols and software are defaulted to the Tophat alignment since there is not yet any way to tell the difference.

comment:9 Changed 19 months ago by Nicklas Nordborg

(In [4595]) References #997: Implement new secondary analysis pipeline

Added a Hisat confirmation wizard. Since there is no next step defined yet (StringTie?) it is will simply import some statistics about the alignment and set the "Successful" annotation on the aligned items.

comment:10 Changed 19 months ago by Nicklas Nordborg

(In [4596]) References #997: Implement new secondary analysis pipeline

Added 'Legacy pipeline' and 'Hisat pipeline' item lists to the installation. The lists are now used by the "Start ... alignment" wizards to display the items waiting for alignment. The counters on the start page now also use the lists.

Note that auto-confirm and the manual confirmation in the demux step should add items to the lists. This is not yet implement. Items are also not removed from the lists.

comment:11 Changed 19 months ago by Nicklas Nordborg

(In [4597]) References #997: Implement new secondary analysis pipeline

The "Confirm demux and merge completed" can now add items to the Hisat/Legacy? pipeline item lists.

comment:12 Changed 19 months ago by Nicklas Nordborg

(In [4598]) References #997: Implement new secondary analysis pipeline

Items are removed from the Hisat/Legacy? pipeline lists when a job has been registered.

Items are re-added to the lists when the "re-align" option is selected in the confirmation wizard.

comment:13 Changed 19 months ago by Nicklas Nordborg

(In [4599]) References #997: Implement new secondary analysis pipeline

The auto-confirmation after the demux step now create Hisat jobs as well. There is not yet any way to get different project default software/protocol items for the Hisat and Tophat pipeline so both paths get the same software/protocol items.

This need to be solved to make it possible to tell the difference alignments made with Tophat vs. Hisat.

comment:14 Changed 19 months ago by Nicklas Nordborg

(In [4600]) References #997: Implement new secondary analysis pipeline

Added auto-confirm implementation for the Hisat alignment. It is basically the same as the old Tophat auto-confirm implementation. It uses the same levels for flagging the RNA (<5M) and stopping the pipeline (<1M). In any case, there is not yet any next step to proceed to.

comment:15 Changed 19 months ago by Nicklas Nordborg

(In [4601]) References #997: Implement new secondary analysis pipeline

Updated the release exporter so that it works even if there is no MaskedSequences item. It will still add an empty line (except for the raw bioassay name) to the cohortMasked.txt.

comment:16 Changed 19 months ago by Nicklas Nordborg

(In [4602]) References #997: Implement new secondary analysis pipeline

Added 'AlignmentType?' annotation type. It can be used on Software and Protocol items with the Alignment subtype. The annotation type has two possible values: 'Tophat' and 'Hisat'.

The "Start Tophat and Cufflinks" wizard will only consider alignment software/protocols that has been annotated with "Tophat".

The "Start Hisat alignment" wizard will only consider alignment software/protocols that has been annotated with "Hisat".

The auto-confirmation after a demux has also been updated to only look for software and protocols with the proper annotation when creating and registering alignment jobs.

Existing alignment software and protocols items should be updated to "AlignmentType?=Tophat".

New software and protocol items should be created for the Hisat pipeline.

comment:17 Changed 19 months ago by Nicklas Nordborg

(In [4603]) References #997: Implement new secondary analysis pipeline

Added alignment software to case summary.

comment:18 Changed 18 months ago by Nicklas Nordborg

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.