GATK HaplotypeCaller

Docker Image: broadinstitute/gatk:4.5.0.0

GATK (Genome Analysis Toolkit) HaplotypeCaller is the Broad Institute’s flagship variant caller, part of the GATK4 suite. Unlike simple pileup-based callers, HaplotypeCaller performs local de novo assembly of haplotypes in active regions of the genome. This makes it particularly strong at detecting indels and complex variants that simpler callers miss.


Approach	Local assembly + Pair-HMM + Bayesian genotyping
Strengths	Indels, complex variants, multi-allelic sites
Parameters	22 across 9 categories
Compute	High — Pair-HMM is the bottleneck

How It Works

HaplotypeCaller operates in four main stages:

1. Active Region Determination

The caller scans the genome and identifies active regions — intervals where there’s evidence of variation. Regions with no evidence of variants are skipped entirely. This is controlled by the active_probability_threshold — the minimum probability a site must have to be considered “active.”

2. Local Assembly

Within each active region, HaplotypeCaller builds a De Bruijn-like assembly graph from the reads. This graph captures all possible haplotypes (sequences of variants) supported by the read data. The assembler uses configurable kmer sizes and pruning thresholds to balance sensitivity against noise.

3. Pairwise HMM (Pair-HMM)

Each read is realigned against every candidate haplotype using a Pair Hidden Markov Model. This computes the likelihood of observing each read given each haplotype, accounting for base quality scores, mapping quality, and gap penalties. This is the most computationally expensive step.

4. Genotyping

Using the Pair-HMM likelihoods as input, the caller applies Bayesian genotyping with configurable priors (heterozygosity, indel heterozygosity) to determine the most likely genotype at each variant site.

Hyperparameters

Quality Filtering

Parameter	Range	Default	Description
`min_base_quality_score`	0-50	10	Minimum base quality to consider a base for calling. Bases below this are ignored entirely. Lower values increase sensitivity (more bases contribute) but may introduce noise from low-quality bases. Higher values improve specificity at the cost of losing some real signal.
`min_mapping_quality_score`	0-60	20	Minimum mapping quality for a read to be used. Reads with mapQ below this are discarded. mapQ 0 means the read maps equally well to multiple locations. Higher values ensure reads are confidently mapped, reducing false positives from mismapped reads. Lower values retain more data but risk noise in repetitive regions.
`base_quality_score_threshold`	0-50	18	Bases with quality below this threshold are “floored” to the minimum quality score. Acts as a soft filter — doesn’t remove bases but reduces their influence. Works in tandem with `min_base_quality_score`.

Calling Confidence

Parameter	Range	Default	Description
`standard_min_confidence_threshold_for_calling`	0-100	30.0	Minimum phred-scaled confidence to emit a variant call. A value of 30 means 99.9% confidence. Lower values increase sensitivity — more variants called, but also more false positives. Higher values increase precision.
`emit_ref_confidence`	NONE / GVCF / BP_RESOLUTION	NONE	Controls reference confidence output. `NONE` outputs only variant sites. `GVCF` emits reference blocks with confidence bands. `BP_RESOLUTION` emits every position.

PCR Error Model

Parameter	Values	Default	Description
`pcr_indel_model`	NONE / HOSTILE / AGGRESSIVE / CONSERVATIVE	CONSERVATIVE	Controls how aggressively the caller filters PCR-induced indel artifacts. `NONE` disables the model. `HOSTILE` is the most aggressive filter (assumes many indels are PCR artifacts). `CONSERVATIVE` applies light filtering. `AGGRESSIVE` is in between.

Assembly Graph

Parameter	Range	Default	Description
`min_pruning`	1-10	2	Minimum number of supporting reads for a path in the assembly graph to survive pruning. Lower values keep more paths, increasing sensitivity to low-frequency or low-coverage variants but also noise. Higher values require more evidence, improving precision.
`max_alternate_alleles`	1-20	6	Maximum number of alternate alleles to genotype at a single site. Higher values allow complex multi-allelic sites but increase compute. Lower values force the caller to pick the top alleles.
`min_dangling_branch_length`	1-20	4	Minimum length of a “dangling” branch (a path that doesn’t connect back to the reference) to attempt recovery via Smith-Waterman alignment. Lower values recover more potential variants from partial read evidence, improving sensitivity. Higher values reduce noise from short, unreliable branches.
`recover_all_dangling_branches`	true/false	false	When enabled, attempts to recover ALL dangling branches regardless of length. Increases sensitivity but may introduce false positives from noise in the assembly graph.
`max_num_haplotypes_in_population`	8-512	128	Maximum number of candidate haplotypes from the assembly graph to evaluate. Higher values explore more possibilities, potentially finding rare or complex variants, but increase compute time significantly. Lower values are faster but may miss variants in complex regions.
`adaptive_pruning_initial_error_rate`	0.0001-0.1	0.001	Starting error rate for the probabilistic adaptive pruning model. This controls how aggressively low-support paths are pruned. Higher values assume more sequencing errors, pruning more aggressively (fewer false positives, potentially fewer true variants). Lower values are more permissive.
`pruning_lod_threshold`	0.5-10.0	2.302585	Log-odds threshold for adaptive pruning. Paths with support below this threshold (relative to the error model) are pruned. The default of ~2.3 corresponds to ln(10), or 10:1 odds. Higher values prune more aggressively. Lower values keep more paths. This interacts with `adaptive_pruning_initial_error_rate`.

Active Region Determination

Parameter	Range	Default	Description
`active_probability_threshold`	0.0001-0.05	0.002	Minimum probability for a locus to be considered “active” (worth investigating for variants). Acts as an early filter in the pipeline. Lower values examine more sites, increasing sensitivity at the cost of runtime. Higher values skip marginal sites.
`min_assembly_region_size`	1-300	50	Minimum size (in bp) of an assembly region. Regions smaller than this are extended. Smaller values allow finer-grained analysis. Larger values provide more context for assembly but may dilute signal in isolated variant sites.
`max_assembly_region_size`	100-1000	300	Maximum size of an assembly region. Regions larger than this are split. Larger values capture more context for complex variants but increase memory and compute. Smaller values may miss variants that span region boundaries.
`assembly_region_padding`	0-500	100	Extra context bases added around each assembly region. Provides flanking sequence for the assembler. Higher values give more context, which helps with variants near region boundaries. Lower values reduce data processed per region.

Pair-HMM / Likelihood Computation

Parameter	Range	Default	Description
`pair_hmm_gap_continuation_penalty`	1-30	10	Flat penalty (Phred-scale) for extending a gap in the Pair-HMM alignment. Higher values penalize longer indels more, reducing sensitivity to long indels but improving specificity. Lower values make the model more tolerant of long gaps.
`phred_scaled_global_read_mismapping_rate`	10-60	45	Global assumed probability that any read is mismapped (Phred-scaled). A value of 45 means ~1 in 30,000 reads is assumed mismapped. Lower values make the caller more skeptical of read evidence — useful in repetitive regions. Higher values trust reads more.

Genotyping Priors

Parameter	Range	Default	Description
`heterozygosity`	0.0001-0.01	0.001	Prior probability of SNP heterozygosity. The default 0.001 matches average human SNP diversity (~1 het per 1000 bases). Higher values make the caller more willing to call heterozygous SNPs (increased sensitivity). Lower values make it more conservative.
`indel_heterozygosity`	0.00001-0.001	0.000125	Prior probability of indel heterozygosity. Similar to `heterozygosity` but for indels. Higher values increase indel sensitivity.
`sample_ploidy`	1-10	2	Assumed ploidy. For human germline calling on autosomes, this should be 2 (diploid). Changing this fundamentally alters the genotyping model.
`contamination_fraction_to_filter`	0.0-0.5	0.0	Estimated fraction of reads from contaminating DNA. When set above 0, the caller downsamples reads supporting alternate alleles proportionally. Useful if you suspect cross-sample contamination is inflating false positives.

Downsampling

Parameter	Range	Default	Description
`max_reads_per_alignment_start`	0-1000	50	Maximum reads retained per alignment start position. 0 disables downsampling. At high coverage, many reads start at the same position. Downsampling reduces compute while retaining most information.

Read Filtering

Parameter	Values	Default	Description
`dont_use_soft_clipped_bases`	true/false	false	When true, soft-clipped portions of reads are ignored. Soft clips often contain misaligned or adapter sequence that can create false variant calls. Enabling may improve precision in regions where reads have significant soft clipping. Disabling (default) uses all bases, which can help detect variants near read boundaries.

​How It Works

​1. Active Region Determination

​2. Local Assembly

​3. Pairwise HMM (Pair-HMM)

​4. Genotyping

​Hyperparameters

​Quality Filtering

​Calling Confidence

​PCR Error Model

​Assembly Graph

​Active Region Determination

​Pair-HMM / Likelihood Computation

​Genotyping Priors

​Downsampling

​Read Filtering

How It Works

1. Active Region Determination

2. Local Assembly

3. Pairwise HMM (Pair-HMM)

4. Genotyping

Hyperparameters

Quality Filtering

Calling Confidence

PCR Error Model

Assembly Graph

Active Region Determination

Pair-HMM / Likelihood Computation

Genotyping Priors

Downsampling

Read Filtering