id,summary,reporter,owner,description,type,status,priority,milestone,component,resolution,keywords,cc 1001,Add HaplotypeCaller to Hisat align step,Nicklas Nordborg,Nicklas Nordborg,"In the Hisat alignment pipeline we should also run !HaplotypeCaller to check genotypes of 51 known SNPs. The script below has been used during the testing phase. We need to re-arrange so that some things go as options into reggie-config.xml. We also need to check that error handling is working as expected. `snp.list` and `snp.vcf` are two files containing information about the 51 SNPs we are interested in. We should investigate if we can do without the `snp.list`. [[BR]]'''[UPDATE] No, we need it, since otherwise the !HaplotypeCaller will check the entire BAM file which will take about one 1 hour instead of ~10 seconds.''' [[BR]]'''[UPDATE 2] It seems like the `-L` parameter also accepts a VCF-file. Thus we can (re-)use the `snp.vcf` and there is no need for the `snp.list` file.''' {{{ JAVA=/usr/bin/java JAR=GenomeAnalysisTK.jar REF=hg38.analysisSet.fa DBSNP=snp.vcf INTERVALS=snp.list INPUT_BAM=$1 OUTPUT_VCF=$2 OPTIONS=""-stand_call_conf 20 --filter_reads_with_N_cigar --genotyping_mode GENOTYPE_GIVEN_ALLELES --annotation AlleleBalance"" ${JAVA} -jar ${JAR} -T HaplotypeCaller -R ${REF} -L ${INTERVALS} --dbsnp ${DBSNP} --alleles ${DBSNP} ${OPTIONS} -I ${INPUT_BAM} -o ${OUTPUT_VCF} }}} The VCF file that is produced should either be uploaded to BASE or we should extract enough information from it to be able to do pair-wise comparisons later on (but this can also be done the first time a sample is included to the pair-wise all-against-all comparison). Some information should be extracted immediately and added as annotations on the !AlignedSequences item: * GT_COUNT: Number of genotypes that we could get from the aligned data. If this value is too low we will not be able to pair-wise comparisons. Typically, we will get 45 or more from most data files. The main reason for getting a lower count is that there are not enough aligned pairs (excluding duplicates). * GT_HET_COUNT: Numer of genotypes that are heterozygous. An unexptected high count (>35) may indicate a contamination. ",enhancement,closed,major,Reggie v4.13,net.sf.basedb.reggie,fixed,,