Opened 6 years ago

Closed 6 years ago

#1001 closed enhancement (fixed)

Add HaplotypeCaller to Hisat align step

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: Reggie v4.13
Component: net.sf.basedb.reggie Keywords:
Cc:

Description (last modified by Nicklas Nordborg)

In the Hisat alignment pipeline we should also run HaplotypeCaller to check genotypes of 51 known SNPs.

The script below has been used during the testing phase. We need to re-arrange so that some things go as options into reggie-config.xml.

We also need to check that error handling is working as expected.

snp.list and snp.vcf are two files containing information about the 51 SNPs we are interested in. We should investigate if we can do without the snp.list.
[UPDATE] No, we need it, since otherwise the HaplotypeCaller will check the entire BAM file which will take about one 1 hour instead of ~10 seconds.
[UPDATE 2] It seems like the -L parameter also accepts a VCF-file. Thus we can (re-)use the snp.vcf and there is no need for the snp.list file.

JAVA=/usr/bin/java
JAR=GenomeAnalysisTK.jar
REF=hg38.analysisSet.fa
DBSNP=snp.vcf
INTERVALS=snp.list
INPUT_BAM=$1
OUTPUT_VCF=$2
OPTIONS="-stand_call_conf 20 --filter_reads_with_N_cigar --genotyping_mode GENOTYPE_GIVEN_ALLELES --annotation AlleleBalance"

${JAVA} -jar ${JAR} -T HaplotypeCaller -R ${REF} 
  -L ${INTERVALS} --dbsnp ${DBSNP} --alleles ${DBSNP} ${OPTIONS} 
  -I ${INPUT_BAM} -o ${OUTPUT_VCF}

The VCF file that is produced should either be uploaded to BASE or we should extract enough information from it to be able to do pair-wise comparisons later on (but this can also be done the first time a sample is included to the pair-wise all-against-all comparison).

Some information should be extracted immediately and added as annotations on the AlignedSequences item:

  • GT_COUNT: Number of genotypes that we could get from the aligned data. If this value is too low we will not be able to pair-wise comparisons. Typically, we will get 45 or more from most data files. The main reason for getting a lower count is that there are not enough aligned pairs (excluding duplicates).
  • GT_HET_COUNT: Numer of genotypes that are heterozygous. An unexptected high count (>35) may indicate a contamination.

Change History (10)

comment:1 by Nicklas Nordborg, 6 years ago

Description: modified (diff)

comment:2 by Nicklas Nordborg, 6 years ago

Description: modified (diff)

comment:3 by Nicklas Nordborg, 6 years ago

Description: modified (diff)

comment:4 by Nicklas Nordborg, 6 years ago

(In [4618]) References #1001: Add HaplotypeCaller to Hisat align step

The VCF file should now be created. No data is parsed out of it yet.

comment:5 by Nicklas Nordborg, 6 years ago

(In [4619]) References #1001: Add HaplotypeCaller to Hisat align step

Added VCF MIME type and file subtype. The VCF file is also copied to the BASE files server after the job has finished.

comment:6 by Nicklas Nordborg, 6 years ago

(In [4620]) References #1001: Add HaplotypeCaller to Hisat align step

Added QC_GenoTypeCount and QC_GenoTypeHET annotation types.

The "Confirm Hisat alignment" wizard has been updated to display the genotype QC values. No warnings are issued due to low count or high HET values.

Added parser implementation for extracting information from VCF files. A popup dialog can be opened from the "Confirm Hisat alignment" wizard to display some data from the VCF.

comment:7 by Nicklas Nordborg, 6 years ago

(In [4621]) References #1001: Add HaplotypeCaller to Hisat align step

Add genotype QC information to the case summary and a link to the VCF statistics dialog.

comment:8 by Nicklas Nordborg, 6 years ago

(In [4622]) References #1001: Add HaplotypeCaller to Hisat align step

VCF statistics is now imported after the Hisat alignment job has ended.

comment:9 by Nicklas Nordborg, 6 years ago

Status: newassigned

comment:10 by Nicklas Nordborg, 6 years ago

Resolution: fixed
Status: assignedclosed
Note: See TracTickets for help on using tickets.