#1199 closed task (fixed)

Implement Variant calling pipeline

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: Reggie v4.24
Component: net.sf.basedb.reggie Keywords:
Cc:

Description (last modified by Nicklas Nordborg)

Updated description

Variant calling is started from Hisat alignment. The first step is a "raw" variant calling using mosdepth and VarDict. The variant calling only depends on the same reference genome that was used for alignment. We save two files (variants-callable.bed and variants-raw.vcg.gz) from this step and attaches them to the same Hisat alignment.

The seconds step is an annotation step and a filtering step. Multiple programs are used to extract and merge data from a lot of external sources (for example dbSnp, Cosmic, Swegene, etc.) This step creates a child rawbioassay item with two more files. An annotated VCF (variants-annotated.vcf.gz) that is the same as the raw VCF with added information from the other database, and a filtered VCF (variants-filtered.vcf) with the variants that survive filtering.

The variants calling wizard can be executed in two modes: raw-only mode will do only the first step and the full mode will do both steps. In full mode, the raw variant calling can be skipped if it has already been done before.

Original description

The starting point can probably be Hisat alignments. It is not yet sure what kind of data that is going to be produced. We will at least get a vcf file containing found variants. It might be possible to attach this data to the current alignment item, but it may also be better to create a child item for this. A child item is better if we get a lot of data files and is also required if we want to re-run the analysis without overwriting the first result.

We can start building the framework for hooking into auto-confirmation and the manual wizards. It can probably be very similar to how the framework for the mBAF pipeline is implemented (#1068).

Change History (41)

comment:1 Changed 11 months ago by Nicklas Nordborg

In 5683:

References #1199: Implement Variant calling pipeline

Defined a new item list "Variant Calling Pipeline". This should contain alignments (from Hisat) that should be processed with by the variant calling pipeline.

Auto-confirmation after Hisat will automatically add items to the list and so will manual confirmation if the "StringTie?" option is selected.

The pipeline can be disabled from auto-analysis by setting the AutoProcessing annotation to Disable.

A link and a counter has been added to the first page.

comment:2 Changed 11 months ago by Nicklas Nordborg

In 5684:

References #1199: Implement Variant calling pipeline

Added a skeleton for the variant calling job creator. It will currently not do anything besides copying the bam file from the the alignment.

A new job type has been defined and the code has been hooked into the auto-confirmation system (JobCompletionHandlerFactory and HistAlignAutoConfirmer).

A software subtype for variant calling already exists (used by the mBAF analysis). This is re-used, but a new value for the VariantCallType annotation is needed. We call this VariantCall for now, but it might be changed later.

comment:3 Changed 11 months ago by Nicklas Nordborg

In 5685:

References #1199: Implement Variant calling pipeline

Added wizard for starting variant calling jobs. It is more or less the same structure as in the mBAF wizard.

comment:4 Changed 11 months ago by Nicklas Nordborg

In 5687:

References #1199: Implement Variant calling pipeline

Implemented script for the first step in the variant calling pipeline. This step is calculating callable regions which are the regions in the alignment that has a depth of at least 5 or more. The important output file is the 'callable.bed' that is currently created in the 'mosdepth' folder. The file is not yet copied back to the project archive.

Added entries in reggie-config.xml for paths to 'bedtools' and to 'mosdepth'. 'zcat' and 'awk' are also used but it is expected that we can use whatever the system provides so there is no explicit path to them.

comment:5 Changed 11 months ago by Nicklas Nordborg

In 5690:

References #1199: Implement Variant calling pipeline

Added call to 'vardict' in the script. Options and paths are currently hardcoded.

comment:6 Changed 11 months ago by Nicklas Nordborg

In 5691:

References #1199: Implement Variant calling pipeline

Most options should now be in reggie-config.xml. A new section <variant-call> has been added with options for the mosdepth, VarDict and var2vcf_valid.pl programs.

Final output is in the file variants.vcf.gz. It is currently left in the temporary working directory.

comment:7 Changed 11 months ago by Nicklas Nordborg

In 5692:

References #1199: Implement Variant calling pipeline

Copy the variants-all.vcf.gz back to the run archive folder for the alignment.

Auto-confirmation is implemented in the same way as for the mBAF pipeline by pre-linking the result file to the job. If the job fails this will be picked up by the auto-confirmation service which re-schedules the alignment for variant calling.

The VCF file produced is different from what we have seen before in that no ID has been assigned to each data line. The simplest way to get around this was to use the line number as a replacement. This works for now since we are only using the count to generate a message, but if we want to compare files in some way we need a different approach. Chromosome+position is a candidate but there can be duplicates with different REF and ALT values so it is not 100% safe.

comment:8 Changed 11 months ago by Nicklas Nordborg

In 5693:

References #1199: Implement Variant calling pipeline

Added a step that calls 'vcfanno' for annotating the vcf file from multiple databases in one go. The command is simple. All important options go into a configuration file that specify the databases and fields to use for annotations. See https://github.com/brentp/vcfanno for more information.

Tested with a small subset of dbSnp to assign rsId numbers to matching variants.

comment:9 Changed 11 months ago by Nicklas Nordborg

In 5696:

References #1199: Implement Variant calling pipeline

Added second annotation step (snpEff). The original script simply pipes the output from vcfanno to snpEff but this complicates error handling (since vcfanno writes messages to stderr that are not error messages). So I think it is better to implement two separate steps with a temporary file inbetween.

comment:10 Changed 11 months ago by Nicklas Nordborg

In 5697:

References #1199: Implement Variant calling pipeline

Improve error handling by using a wrapper script so that output written too stderr by 'vcfanno' and 'snpEff' is redirected to a log file if the commands are successful. It is only if they fail that the output is written to the real stderr.

comment:11 Changed 11 months ago by Nicklas Nordborg

In 5698:

References #1199: Implement Variant calling pipeline

Added wrapper script for directing stderr to different places depending on exit status of command.

comment:12 Changed 11 months ago by Nicklas Nordborg

In 5699:

References #1199: Implement Variant calling pipeline

Added filtering step with SnpSift.

comment:13 Changed 11 months ago by Nicklas Nordborg

In 5701:

References #1199: Implement Variant calling pipeline

Reorganised some configuration options as databases and other things get installed at proper locations.

Also added a few script lines that output some statistics to 'stats.out' so that we don't have to parse the vcf files from Reggie.

comment:14 Changed 11 months ago by Nicklas Nordborg

In 5704:

References #1199: Implement Variant calling pipeline

All programs (mosdepth and vcfanno) should now be configured to their installed locations.

Store the final filtered vcf uncompressed since it is typically only a few kb in size.

comment:15 Changed 11 months ago by Nicklas Nordborg

In 5705:

References #1199: Implement Variant calling pipeline

Moving the 't' suffix to a property of 'Rawdatatype' to make it possible to introduce more raw data types with other suffixes.

comment:16 Changed 11 months ago by Nicklas Nordborg

In 5706:

References #1199: Implement Variant calling pipeline

Defined VariantCall raw data type and 3 annotations:

  • CallableBases: Total number of bases that are callable
  • VariantsAll: Total number of variants found
  • VariantsPassedFilter: Number of variants after filtering

comment:17 Changed 11 months ago by Nicklas Nordborg

In 5709:

References #1199: Implement Variant calling pipeline

Added python script for calculating GC and DNA STRENGTH values and a toml file that can be used by vcfanno to bring back the information as annotations to the original (raw) VCF file.

comment:18 Changed 11 months ago by Nicklas Nordborg

In 5710:

References #1199: Implement Variant calling pipeline

Changed the generated script to calculate GC and DNA strength around all raw variants. A special python script has been implemented to help with this. The calculated values are put back into the raw VCF file with help from vcfanno.

Outline of the procedure:

  1. From the raw VCF file an awk command is used to generate a BED file with ranges that span -50 to +50 base pairs from each variant location. The BED file also include orignal position and REF/ALT columns since we need that to be able to generate a final VCF file.
  2. bedtools getfasta is used to add a column with the sequence from -50 to +50 to the BED file.
  3. gc_stat.py is used to calculate GC and DNA strength from the information in the BED and saves it as a VCF file.
  4. The VCF file is indexed (tabix) and then vcfanno is used to copy the GC and DNA strength information back into the raw VCF file.

comment:19 Changed 11 months ago by Nicklas Nordborg

In 5711:

References #1199: Implement Variant calling pipeline

Implemented calculation of distance to nearest G5 SNP in the generated script. This means that we no longer have to pull that from the "mut_stats" information. The implementation requries that two additional files with pre-calculated data is present in the dbsnp data directory:

  • dbsnp_g5_snv.bed.gz: Locations of SNP entries marked with G5 and VC=SNV.
  • dbsnp_g5_div.bed.gz: Locations of SNP enties marked with G5 and VC<>SNV.

Both files can be generated from the main dbSnp VCF file. A script for this has been added to pipelien repository: http://baseplugins.thep.lu.se/browser/other/pipeline/trunk/calc_dbsnp_g5_dist.sh

comment:20 Changed 11 months ago by Nicklas Nordborg

In 5712:

References #1199: Implement Variant calling pipeline

Added helper script for pre-calculating locations of G5 SNPs in dbSnp.

comment:21 Changed 11 months ago by Nicklas Nordborg

In 5713:

References #1199: Implement Variant calling pipeline

Added options in the wizard interface:

  • Select if only the raw variant calling step should be performed or if the full pipeline with annotating and filtering should be executed
  • If raw variants already exists there is an option to skip this step

Actual implementation of the options have not been done yet.

comment:22 Changed 11 months ago by Nicklas Nordborg

In 5714:

References #1199: Implement Variant calling pipeline

Implemented a check for existing raw variants and if the other options (rawOnly=TRUE and skipRaw=TRUE) are set so that nothing should be done the alignment is skipped. This is reported in the wizard.

comment:23 Changed 11 months ago by Nicklas Nordborg

In 5715:

References #1199: Implement Variant calling pipeline

Implemented support for re-using an existing raw variant calling file.

comment:24 Changed 11 months ago by Nicklas Nordborg

In 5716:

References #1199: Implement Variant calling pipeline

Implemented support for doing raw variant calling only.

comment:25 Changed 11 months ago by Nicklas Nordborg

In 5717:

References #1199: Implement Variant calling pipeline

Re-factored the implementation that calculates distances to G5 entries in dbSnp (introduced in [5711]). The new implementation will create a single BED file (g5-dist-2.bed) that contains matched distances for both the VC=SNV and VC<>SNV entries and additional information (REF and ALT). This BED file is then converted to a VCF file by a custom python script (g5_dist.py) with DIST_SNP and DIST_DIV annotations. The final annotation step can then use the VCF file instead of the two BED files. The main reason for the change is that we get a nicer description in the header about the DIST_SNP and DIST_DIV annotations.

comment:26 Changed 11 months ago by Nicklas Nordborg

In 5718:

References #1199: Implement Variant calling pipeline

Helper script and updated toml file for the changes in [5717].

comment:27 Changed 11 months ago by Nicklas Nordborg

In 5719:

References #1199: Implement Variant calling pipeline

Added documentation and set some svn properties on the python scripts.

comment:28 Changed 11 months ago by Nicklas Nordborg

In 5720:

References #1199: Implement Variant calling pipeline

Checked the implementation to create child raw bioassay for the filtered part of the variant calling.

This means that if we run the pipeline in raw-only mode, the raw variant calling files will be attached to the alignment and no child raw bioassay is created.

If we run the pipeline in filtering mode, a child raw bioassay is created and the filtered vcf file is attached to that.

comment:29 Changed 11 months ago by Nicklas Nordborg

In 5721:

References #1199: Implement Variant calling pipeline

Renamed the raw variant file to variants-raw.vcf.gz. Fixed annotations so that they attach to the correct item and are imported from the statistics:

  • VariantsRaw and CallableBases: Belong to the Hisat alignment
  • VariantsFiltered: Belong to the VariantCall raw bioassay

comment:30 Changed 11 months ago by Nicklas Nordborg

In 5722:

References #1199: Implement Variant calling pipeline

Updated the "Secondary analysis cleanup" wizard to also support VariantCall raw bioassays.

comment:31 Changed 11 months ago by Nicklas Nordborg

In 5723:

References #1199: Implement Variant calling pipeline

Auto-confirmation need to be handled differently if the variant calling is raw-only mode or full mode. The current implementation can be used for the raw-only mode which simply put the alignment back into the variant calling pipeline list.

comment:32 Changed 11 months ago by Nicklas Nordborg

In 5724:

References #1199: Implement Variant calling pipeline

Auto-confirmation for the full variant calling should now be implemented. This will behave more like the confirmation for StringTie and Cufflinks which means that we also need a manual confirmation wizard to set the AnalysisResult annotation for the raw bioassays in case there is a problem. The index page has been prepared for this but the wizard is not yet implemented.

comment:33 Changed 11 months ago by Nicklas Nordborg

In 5725:

References #1199: Implement Variant calling pipeline

Added RNAseq/Hisat/VariantCall value to the pipeline annotation.

comment:34 Changed 11 months ago by Nicklas Nordborg

In 5728:

References #1199: Implement Variant calling pipeline

Implemented wizard for manual confirmation of variant calling.

comment:35 Changed 11 months ago by Nicklas Nordborg

In 5729:

References #1199: Implement Variant calling pipeline

Added support for parsing the data from INFO column in VCF files. An InfoFactory implementation is needed that knows how to extract data into a more usable format. A simple implementation that extract everything into string key/value pairs have been provided.

comment:36 Changed 11 months ago by Nicklas Nordborg

In 5730:

References #1199: Implement Variant calling pipeline

Added a dialog for displaying some information about the discovered variants. So far, the Gene name and an optional link to Cosmic is extracted and displayed. There is a lot more information and upon request it would be possible to display more.

comment:37 Changed 10 months ago by Nicklas Nordborg

In 5731:

References #1199: Implement Variant calling pipeline

Added a variant calling section to the case summary.

comment:38 Changed 10 months ago by Nicklas Nordborg

Status: newaccepted

comment:39 Changed 10 months ago by Nicklas Nordborg

In 5735:

References #1199: Implement Variant calling pipeline

Removed version number from variant annotations. This was used on Cosmic and dbSnp annotations but would make it harder to write code in a future-proof way. For example, we already display the Cosmic ID in one dialog and it is better to have that under a generic name.

comment:40 Changed 10 months ago by Nicklas Nordborg

In 5736:

References #1199: Implement Variant calling pipeline

Fixed some documentation and changes the auto-confirmation from Hisat to start raw-only variant calling.

comment:41 Changed 10 months ago by Nicklas Nordborg

Description: modified (diff)
Resolution: fixed
Status: acceptedclosed
Note: See TracTickets for help on using tickets.