Opened 7 years ago
Closed 7 years ago
#997 closed task (fixed)
Implement new secondary analysis pipeline
Reported by: | Nicklas Nordborg | Owned by: | Nicklas Nordborg |
---|---|---|---|
Priority: | critical | Milestone: | Reggie v4.12 |
Component: | net.sf.basedb.reggie | Keywords: | |
Cc: |
Description
The new secondary analysis pipeline should be implemented. It starts with MergedSequences
. The first step is similar to the old Mask+Align step but uses Hisat instead of Tophat. The post-processing scripts that is run afterwards for collecting statistics should not have to be changed but this needs to be verified. Just as for the original pipeline there is a breakpoint after the alignment step which means we have to store result files back to the project archive.
Auto-confirm rules should be considered but we may want to start with manual confirmation.
The second step is to calculate expression values with Stringtie. We need a new raw data type to be able to separate this from the cufflinks data. We need to investigate what kind of files that are produced by Stringtie and which files we should define as file types and which files that should only be generically linked.
Since the new pipeline is going to live alongside the legacy pipeline we need a way to separate items. We can, for example, define new subtypes for items belonging to the new pipeline. While this makes it relatively easy to implement the new pipeline there are some drawbacks in other areas:
- The current structure (
RawBioAssay -> AlignedSequences -> MaskedSequences -> MergedSequences
) is built into a lot of other places like the case summary, yellow label wizard, release exporter, etc. Introducing new subtypes will require changes in several other places to make them behave as we want to. - We also need new subtypes for protocols and software.
- If we add more pipelines in the future there is going to be a "subtype explosion" which will make the first point above even more complex to handle.
Another possibility is to keep and re-use the current subtypes and maybe use an annotation to indicate which pipeline an item belongs to. We still need to check the case summary, yellow label wizard, etc. but I think fewer changes are needed. More care is needed when implementing the new pipeline since we really have to be sure that items are not mixed up and suddenly starts to be processed by the incorrect pipeline.
Change History (18)
comment:1 by , 7 years ago
comment:2 by , 7 years ago
(In [4588]) References #997: Implement new secondary analysis pipeline
Adding a new Job subtype "HisatAlign" to be used for the new pipeline.
comment:3 by , 7 years ago
comment:4 by , 7 years ago
(In [4590]) References #997: Implement new secondary analysis pipeline
Added HisatAlignJobCreator
which should be used to create and submit scripts for Hisat alignment to the cluster. This class is very similar to the old AlignJobCreator
that existed before the old pipeline merged all steps into a single job.
Input parameters to the job creator are a Software, a Protocol and a list of MergedSequences items. One job is created for each merged item. It will use options from the <align-hisat> configuration section and supports "ParameterSet" via the annotation on the Software item. The script will create one child AlignedSequences item for each merged sequences.
After copying the FASTQ files from the project archive to the cluster node it will run Hisat (no masking step). The main file generated is aligned/alignment.bam
. After that it will also run picard MarkDuplicates
and picard AddOrReplaceReadGroups
. The last step is to extract some statistics.
When the job has been completed Reggie will import some values and link files to the aligned sequences item. The major difference is that no PM_READS value can be set since the masking step is missing. Hisat also produces a different set of files.
Note! The is not yet any way to invoke this pipeline via the Reggie web interface.
comment:5 by , 7 years ago
comment:6 by , 7 years ago
(In [4592]) References #997: Implement new secondary analysis pipeline
Removed the AddOrReplaceReadGroups step since it is possible to include it in the Hisat alignment. FlowCellId is a new parameter. If a sample has been sequenced in multiple sequencing runs the id is created by concatenating the flow cell id from all runs.
comment:7 by , 7 years ago
comment:8 by , 7 years ago
(In [4594]) References #997: Implement new secondary analysis pipeline
Started to implement a wizard for manually starting Hisat alignment. It is basically the same as the old wizard that started Tophat alignment.
There is not yet support for detecting which merged sequences that should be aligned. Protocols and software are defaulted to the Tophat alignment since there is not yet any way to tell the difference.
comment:9 by , 7 years ago
comment:10 by , 7 years ago
(In [4596]) References #997: Implement new secondary analysis pipeline
Added 'Legacy pipeline' and 'Hisat pipeline' item lists to the installation. The lists are now used by the "Start ... alignment" wizards to display the items waiting for alignment. The counters on the start page now also use the lists.
Note that auto-confirm and the manual confirmation in the demux step should add items to the lists. This is not yet implement. Items are also not removed from the lists.
comment:11 by , 7 years ago
(In [4597]) References #997: Implement new secondary analysis pipeline
The "Confirm demux and merge completed" can now add items to the Hisat/Legacy pipeline item lists.
comment:12 by , 7 years ago
(In [4598]) References #997: Implement new secondary analysis pipeline
Items are removed from the Hisat/Legacy pipeline lists when a job has been registered.
Items are re-added to the lists when the "re-align" option is selected in the confirmation wizard.
comment:13 by , 7 years ago
(In [4599]) References #997: Implement new secondary analysis pipeline
The auto-confirmation after the demux step now create Hisat jobs as well. There is not yet any way to get different project default software/protocol items for the Hisat and Tophat pipeline so both paths get the same software/protocol items.
This need to be solved to make it possible to tell the difference alignments made with Tophat vs. Hisat.
comment:14 by , 7 years ago
(In [4600]) References #997: Implement new secondary analysis pipeline
Added auto-confirm implementation for the Hisat alignment. It is basically the same as the old Tophat auto-confirm implementation. It uses the same levels for flagging the RNA (<5M) and stopping the pipeline (<1M). In any case, there is not yet any next step to proceed to.
comment:15 by , 7 years ago
comment:16 by , 7 years ago
(In [4602]) References #997: Implement new secondary analysis pipeline
Added 'AlignmentType' annotation type. It can be used on Software and Protocol items with the Alignment subtype. The annotation type has two possible values: 'Tophat' and 'Hisat'.
The "Start Tophat and Cufflinks" wizard will only consider alignment software/protocols that has been annotated with "Tophat".
The "Start Hisat alignment" wizard will only consider alignment software/protocols that has been annotated with "Hisat".
The auto-confirmation after a demux has also been updated to only look for software and protocols with the proper annotation when creating and registering alignment jobs.
Existing alignment software and protocols items should be updated to "AlignmentType=Tophat".
New software and protocol items should be created for the Hisat pipeline.
comment:17 by , 7 years ago
comment:18 by , 7 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
(In [4587]) References #997: Implement new secondary analysis pipeline
Adding configuration settings for the Hisat alignment. Options go into the <align-hisat> tag and are similar to options alread present for the legacy tophat alignment.