Opened 2 months ago

Last modified 26 hours ago

#1575 new task

Impute genotypes from OncoArray data

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: Reggie v4.54
Component: net.sf.basedb.reggie Keywords:
Cc:

Description

We have some samples which have been genotyped on the OncoArray platform. It should be possible to implement an analysis that impute genotypes for a lot more positions.

We have already made some tests with Shapeit5 (https://odelaneau.github.io/shapeit5/) and Impute5 (https://jmarchini.org/software/#impute-5) which seems to be working well.

Reference files can be downloaded from: https://github.com/odelaneau/shapeit4/tree/master/maps and https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/

Once we have the imputed genotypes we can use them to calculate a Polygenic Risc Score (PRS). See other ticket to be created...

Change History (27)

comment:1 by Nicklas Nordborg, 2 months ago

In 7599:

References #1575: Impute genotypes from OncoArray data

Added a new section in the "Other analysis" tab.

comment:2 by Nicklas Nordborg, 2 months ago

In 7600:

References #1575: Impute genotypes from OncoArray data

Created a new item list: Genotype imputation pipeline

comment:3 by Nicklas Nordborg, 2 months ago

In 7601:

References #1575: Impute genotypes from OncoArray data

Started to implement the wizard for starting genotype imputation. It is possible to select raw bioassays with pipline=DNA/Genotyping, but jobs are not submitted to the cluster.

comment:4 by Nicklas Nordborg, 2 months ago

In 7602:

References #1575: Impute genotypes from OncoArray data

Started to implement job submission. The current script simply copies the input vcf to the output.

comment:5 by Nicklas Nordborg, 2 months ago

In 7603:

References #1575: Impute genotypes from OncoArray data

Added a container definition with ShapeIT and Impute5. It's a little bit more messy than the typical container definition since the two programs are not available in any repository. We use the pre-compiled binary versions.

Added a section to the script that pre-process the VCF file we have to add AC and AN fields that are required by the other software.

comment:6 by Nicklas Nordborg, 2 months ago

In 7604:

References #1575: Impute genotypes from OncoArray data

Implemented the phasing step with ShapeIT. Since the program is not so good at multi-threading, we start one process per chromosome instead limited to the number of assigned threads. The wrapper code is very similar to the code in the WGS variant calling. A full run takes ~1.5 hour to complete this step (using 8 threads).

Last edited 2 months ago by Nicklas Nordborg (previous) (diff)

comment:7 by Nicklas Nordborg, 8 weeks ago

In 7605:

References #1575: Impute genotypes from OncoArray data

Implemented the impute step with IMPUTE5. Using the same procedure with starting multiple processes.

comment:8 by Nicklas Nordborg, 8 weeks ago

In 7606:

References #1575: Impute genotypes from OncoArray data

Merging results into a single VCF file. Importing number of imputed genotypes as an annotation.

comment:9 by Nicklas Nordborg, 8 weeks ago

In 7607:

References #1575: Impute genotypes from OncoArray data

Added auto-confirm implementation for the imputation.

comment:10 by Nicklas Nordborg, 7 weeks ago

In 7608:

References #1575: Impute genotypes from OncoArray data

Changed the reference panel to the phase3 panel that has been lifter over from the hg19 version.

comment:11 by Nicklas Nordborg, 5 weeks ago

In 7610:

References #1575: Impute genotypes from OncoArray data

Started to redesign the "Start imputation wizard" so that an item list is selected that should contain all items that should be imputed together.

comment:12 by Nicklas Nordborg, 5 weeks ago

In 7611:

References #1575: Impute genotypes from OncoArray data

Updated the job submission so that it works with an item list. The analysis script still assumes a single sample so it will crash.

comment:13 by Nicklas Nordborg, 5 weeks ago

In 7612:

References #1575: Impute genotypes from OncoArray data

Removed the Genotype imputation pipeline list since it will not be used.

comment:14 by Nicklas Nordborg, 5 weeks ago

In 7613:

References #1575: Impute genotypes from OncoArray data

Updated the script so that it copies all VCF files and merges them to a single big VCF.

comment:15 by Nicklas Nordborg, 5 weeks ago

In 7614:

References #1575: Impute genotypes from OncoArray data

Changed our container to use Beagle instead of ShapeIT and Impute5. Installation is a lot easier since Beagle is available on bioconda.

comment:16 by Nicklas Nordborg, 5 weeks ago

In 7615:

References #1575: Impute genotypes from OncoArray data

Changed to Beagle in the script. Since Beagle works well with multi-threading we don't have to start processes in the background which makes the script a lot simpler.

comment:17 by Nicklas Nordborg, 5 weeks ago

In 7616:

References #1575: Impute genotypes from OncoArray data

Added two more steps to the script. First a step that split the imputed result files into a bunch of files for each sample. The second step concatenate the files that are related to a single sample so that the end result is one VCF with imputed data per sample.

comment:18 by Nicklas Nordborg, 5 weeks ago

In 7617:

References #1575: Impute genotypes from OncoArray data

Save result files back to the project archive.

comment:19 by Nicklas Nordborg, 5 weeks ago

In 7618:

References #1575: Impute genotypes from OncoArray data

Use multiple processes when concatenating results files per sample.

comment:20 by Nicklas Nordborg, 4 weeks ago

In 7619:

References #1575: Impute genotypes from OncoArray data

Fixes an issue with a missing header definition for the END tag that is used in the INFO field. This seems to be a bug in the Beagle version we are using (5.4 22Jul22.46e). The bug seems to be fixed in 5.4 01Mar24.d36 but this version is not yet available on Bioconda.

comment:21 by Nicklas Nordborg, 4 weeks ago

In 7620:

References #1575: Impute genotypes from OncoArray data

Update the auto-confirmation handler so that all items in the imputed list are added to the PRS list.

comment:22 by Nicklas Nordborg, 4 weeks ago

In 7624:

References #1575: Impute genotypes from OncoArray data

The split step is now done with multiple parallel processes. The default is to use 8, but this can be changed by setting the SplitJobsLimit variable. The ConcatJobsLimit variable can be used to control the number of jobs in the concat step.

comment:23 by Nicklas Nordborg, 4 weeks ago

In 7625:

References #1575: Impute genotypes from OncoArray data

We need to remove some temporary files since we will run out of disk space otherwise. We still need >1TB for the 1995 samples.

comment:24 by Nicklas Nordborg, 3 weeks ago

In 7645:

References #1575: Impute genotypes from OncoArray data

Changed the script so that the concat step is saving the result files directly to the project archive. This will avoid the disk space problem and will be quicker since we can skip one file copy step.

comment:25 by Nicklas Nordborg, 3 weeks ago

In 7646:

References #1575: Impute genotypes from OncoArray data

Added an option to save the full imputed VCF files to the job folder.

comment:26 by Nicklas Nordborg, 10 days ago

In 7654:

References #1575: Impute genotypes from OncoArray data

We need the pipefail option or some piped commands may fail without detection.

comment:27 by Nicklas Nordborg, 26 hours ago

In 7673:

References #1575: Impute genotypes from OncoArray data

Documented default values for some parameters.

Note: See TracTickets for help on using tickets.