Skip to main content
Docker Image: broadinstitute/gatk:4.5.0.0
GATK (Genome Analysis Toolkit) HaplotypeCaller is the Broad Institute’s flagship variant caller, part of the GATK4 suite. Unlike simple pileup-based callers, HaplotypeCaller performs local de novo assembly of haplotypes in active regions of the genome. This makes it particularly strong at detecting indels and complex variants that simpler callers miss.
ApproachLocal assembly + Pair-HMM + Bayesian genotyping
StrengthsIndels, complex variants, multi-allelic sites
Parameters22 across 9 categories
ComputeHigh — Pair-HMM is the bottleneck

How It Works

HaplotypeCaller operates in four main stages:

1. Active Region Determination

The caller scans the genome and identifies active regions — intervals where there’s evidence of variation. Regions with no evidence of variants are skipped entirely. This is controlled by the active_probability_threshold — the minimum probability a site must have to be considered “active.”

2. Local Assembly

Within each active region, HaplotypeCaller builds a De Bruijn-like assembly graph from the reads. This graph captures all possible haplotypes (sequences of variants) supported by the read data. The assembler uses configurable kmer sizes and pruning thresholds to balance sensitivity against noise.

3. Pairwise HMM (Pair-HMM)

Each read is realigned against every candidate haplotype using a Pair Hidden Markov Model. This computes the likelihood of observing each read given each haplotype, accounting for base quality scores, mapping quality, and gap penalties. This is the most computationally expensive step.

4. Genotyping

Using the Pair-HMM likelihoods as input, the caller applies Bayesian genotyping with configurable priors (heterozygosity, indel heterozygosity) to determine the most likely genotype at each variant site.

Hyperparameters

Quality Filtering

ParameterRangeDefaultDescription
min_base_quality_score0-5010Minimum base quality to consider a base for calling. Bases below this are ignored entirely. Lower values increase sensitivity (more bases contribute) but may introduce noise from low-quality bases. Higher values improve specificity at the cost of losing some real signal.
min_mapping_quality_score0-6020Minimum mapping quality for a read to be used. Reads with mapQ below this are discarded. mapQ 0 means the read maps equally well to multiple locations. Higher values ensure reads are confidently mapped, reducing false positives from mismapped reads. Lower values retain more data but risk noise in repetitive regions.
base_quality_score_threshold0-5018Bases with quality below this threshold are “floored” to the minimum quality score. Acts as a soft filter — doesn’t remove bases but reduces their influence. Works in tandem with min_base_quality_score.

Calling Confidence

ParameterRangeDefaultDescription
standard_min_confidence_threshold_for_calling0-10030.0Minimum phred-scaled confidence to emit a variant call. A value of 30 means 99.9% confidence. Lower values increase sensitivity — more variants called, but also more false positives. Higher values increase precision.
emit_ref_confidenceNONE / GVCF / BP_RESOLUTIONNONEControls reference confidence output. NONE outputs only variant sites. GVCF emits reference blocks with confidence bands. BP_RESOLUTION emits every position.

PCR Error Model

ParameterValuesDefaultDescription
pcr_indel_modelNONE / HOSTILE / AGGRESSIVE / CONSERVATIVECONSERVATIVEControls how aggressively the caller filters PCR-induced indel artifacts. NONE disables the model. HOSTILE is the most aggressive filter (assumes many indels are PCR artifacts). CONSERVATIVE applies light filtering. AGGRESSIVE is in between.

Assembly Graph

ParameterRangeDefaultDescription
min_pruning1-102Minimum number of supporting reads for a path in the assembly graph to survive pruning. Lower values keep more paths, increasing sensitivity to low-frequency or low-coverage variants but also noise. Higher values require more evidence, improving precision.
max_alternate_alleles1-206Maximum number of alternate alleles to genotype at a single site. Higher values allow complex multi-allelic sites but increase compute. Lower values force the caller to pick the top alleles.
min_dangling_branch_length1-204Minimum length of a “dangling” branch (a path that doesn’t connect back to the reference) to attempt recovery via Smith-Waterman alignment. Lower values recover more potential variants from partial read evidence, improving sensitivity. Higher values reduce noise from short, unreliable branches.
recover_all_dangling_branchestrue/falsefalseWhen enabled, attempts to recover ALL dangling branches regardless of length. Increases sensitivity but may introduce false positives from noise in the assembly graph.
max_num_haplotypes_in_population8-512128Maximum number of candidate haplotypes from the assembly graph to evaluate. Higher values explore more possibilities, potentially finding rare or complex variants, but increase compute time significantly. Lower values are faster but may miss variants in complex regions.
adaptive_pruning_initial_error_rate0.0001-0.10.001Starting error rate for the probabilistic adaptive pruning model. This controls how aggressively low-support paths are pruned. Higher values assume more sequencing errors, pruning more aggressively (fewer false positives, potentially fewer true variants). Lower values are more permissive.
pruning_lod_threshold0.5-10.02.302585Log-odds threshold for adaptive pruning. Paths with support below this threshold (relative to the error model) are pruned. The default of ~2.3 corresponds to ln(10), or 10:1 odds. Higher values prune more aggressively. Lower values keep more paths. This interacts with adaptive_pruning_initial_error_rate.

Active Region Determination

ParameterRangeDefaultDescription
active_probability_threshold0.0001-0.050.002Minimum probability for a locus to be considered “active” (worth investigating for variants). Acts as an early filter in the pipeline. Lower values examine more sites, increasing sensitivity at the cost of runtime. Higher values skip marginal sites.
min_assembly_region_size1-30050Minimum size (in bp) of an assembly region. Regions smaller than this are extended. Smaller values allow finer-grained analysis. Larger values provide more context for assembly but may dilute signal in isolated variant sites.
max_assembly_region_size100-1000300Maximum size of an assembly region. Regions larger than this are split. Larger values capture more context for complex variants but increase memory and compute. Smaller values may miss variants that span region boundaries.
assembly_region_padding0-500100Extra context bases added around each assembly region. Provides flanking sequence for the assembler. Higher values give more context, which helps with variants near region boundaries. Lower values reduce data processed per region.

Pair-HMM / Likelihood Computation

ParameterRangeDefaultDescription
pair_hmm_gap_continuation_penalty1-3010Flat penalty (Phred-scale) for extending a gap in the Pair-HMM alignment. Higher values penalize longer indels more, reducing sensitivity to long indels but improving specificity. Lower values make the model more tolerant of long gaps.
phred_scaled_global_read_mismapping_rate10-6045Global assumed probability that any read is mismapped (Phred-scaled). A value of 45 means ~1 in 30,000 reads is assumed mismapped. Lower values make the caller more skeptical of read evidence — useful in repetitive regions. Higher values trust reads more.

Genotyping Priors

ParameterRangeDefaultDescription
heterozygosity0.0001-0.010.001Prior probability of SNP heterozygosity. The default 0.001 matches average human SNP diversity (~1 het per 1000 bases). Higher values make the caller more willing to call heterozygous SNPs (increased sensitivity). Lower values make it more conservative.
indel_heterozygosity0.00001-0.0010.000125Prior probability of indel heterozygosity. Similar to heterozygosity but for indels. Higher values increase indel sensitivity.
sample_ploidy1-102Assumed ploidy. For human germline calling on autosomes, this should be 2 (diploid). Changing this fundamentally alters the genotyping model.
contamination_fraction_to_filter0.0-0.50.0Estimated fraction of reads from contaminating DNA. When set above 0, the caller downsamples reads supporting alternate alleles proportionally. Useful if you suspect cross-sample contamination is inflating false positives.

Downsampling

ParameterRangeDefaultDescription
max_reads_per_alignment_start0-100050Maximum reads retained per alignment start position. 0 disables downsampling. At high coverage, many reads start at the same position. Downsampling reduces compute while retaining most information.

Read Filtering

ParameterValuesDefaultDescription
dont_use_soft_clipped_basestrue/falsefalseWhen true, soft-clipped portions of reads are ignored. Soft clips often contain misaligned or adapter sequence that can create false variant calls. Enabling may improve precision in regions where reads have significant soft clipping. Disabling (default) uses all bases, which can help detect variants near read boundaries.