Opened 6 years ago

Last modified 6 years ago

#1001 closed enhancement

Add HaplotypeCaller to Hisat align step — at Version 2

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: Reggie v4.13
Component: net.sf.basedb.reggie Keywords:
Cc:

Description (last modified by Nicklas Nordborg)

In the Hisat alignment pipeline we should also run HaplotypeCaller to check genotypes of 51 known SNPs.

The script below has been used during the testing phase. We need to re-arrange so that some things go as options into reggie-config.xml.

We also need to check that error handling is working as expected.

snp.list and snp.vcf are two files containing information about the 51 SNPs we are interested in. We should investigate if we can do without the snp.list. [UPDATE] No, we need it, since otherwise the HaplotypeCaller will check the entire BAM file which will take about one 1 hour instead of ~10 seconds. [UPDATE 2] It seems like the -L parameter also accepts a VCF-file. Thus we can (re-)use the snp.vcf and there is no need for the snp.list file.

JAVA=/usr/bin/java
JAR=GenomeAnalysisTK.jar
REF=hg38.analysisSet.fa
DBSNP=snp.vcf
INTERVALS=snp.list
INPUT_BAM=$1
OUTPUT_VCF=$2
OPTIONS="-stand_call_conf 20 --filter_reads_with_N_cigar --genotyping_mode GENOTYPE_GIVEN_ALLELES --annotation AlleleBalance"

${JAVA} -jar ${JAR} -T HaplotypeCaller -R ${REF} 
  -L ${INTERVALS} --dbsnp ${DBSNP} --alleles ${DBSNP} ${OPTIONS} 
  -I ${INPUT_BAM} -o ${OUTPUT_VCF}

The VCF file that is produced should either be uploaded to BASE or we should extract enough information from it to be able to do pair-wise comparisons later on (but this can also be done the first time a sample is included to the pair-wise all-against-all comparison).

Some information should be extracted immediately and added as annotations on the AlignedSequences item:

  • GT_COUNT: Number of genotypes that we could get from the aligned data. If this value is too low we will not be able to pair-wise comparisons. Typically, we will get 45 or more from most data files. The main reason for getting a lower count is that there are not enough aligned pairs (excluding duplicates).
  • GT_HET_COUNT: Numer of genotypes that are heterozygous. An unexptected high count (>35) may indicate a contamination.

Change History (2)

comment:1 by Nicklas Nordborg, 6 years ago

Description: modified (diff)

comment:2 by Nicklas Nordborg, 6 years ago

Description: modified (diff)
Note: See TracTickets for help on using tickets.