Opened 7 years ago

Last modified 7 years ago

#1001 closed enhancement

Add HaplotypeCaller to Hisat align step — at Initial Version

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: Reggie v4.13
Component: net.sf.basedb.reggie Keywords:
Cc:

Description

In the Hisat alignment pipeline we should also run HaplotypeCaller to check genotypes of 51 known SNPs.

The script below has been used during the testing phase. We need to re-arrange so that some things go as options into reggie-config.xml.

We also need to check that error handling is working as expected.

snp.list and snp.vcf are two files containing information about the 51 SNPs we are interested in. We should investigate if we can do without the snp.list.

JAVA=/usr/bin/java
JAR=GenomeAnalysisTK.jar
REF=hg38.analysisSet.fa
DBSNP=snp.vcf
INTERVALS=snp.list
INPUT_BAM=$1
OUTPUT_VCF=$2
OPTIONS="-stand_call_conf 20 --filter_reads_with_N_cigar --genotyping_mode GENOTYPE_GIVEN_ALLELES --annotation AlleleBalance"

${JAVA} -jar ${JAR} -T HaplotypeCaller -R ${REF} 
  -L ${INTERVALS} --dbsnp ${DBSNP} --alleles ${DBSNP} ${OPTIONS} 
  -I ${INPUT_BAM} -o ${OUTPUT_VCF}

The VCF file that is produced should either be uploaded to BASE or we should extract enough information from it to be able to do pair-wise comparisons later on (but this can also be done the first time a sample is included to the pair-wise all-against-all comparison).

Some information should be extracted immediately and added as annotations on the AlignedSequences item:

  • GT_COUNT: Number of genotypes that we could get from the aligned data. If this value is too low we will not be able to pair-wise comparisons. Typically, we will get 45 or more from most data files. The main reason for getting a lower count is that there are not enough aligned pairs (excluding duplicates).
  • GT_HET_COUNT: Numer of genotypes that are heterozygous. An unexptected high count (>35) may indicate a contamination.

Change History (0)

Note: See TracTickets for help on using tickets.