Docker Image: broadinstitute/gatk:4.5.0.0
GATK (Genome Analysis Toolkit) HaplotypeCaller is the Broad Institute’s flagship variant caller, part of the GATK4 suite. Unlike simple pileup-based callers, HaplotypeCaller performs local de novo assembly of haplotypes in active regions of the genome. This makes it particularly strong at detecting indels and complex variants that simpler callers miss.
| |
|---|
| Approach | Local assembly + Pair-HMM + Bayesian genotyping |
| Strengths | Indels, complex variants, multi-allelic sites |
| Parameters | 22 across 9 categories |
| Compute | High — Pair-HMM is the bottleneck |
How It Works
HaplotypeCaller operates in four main stages:
1. Active Region Determination
The caller scans the genome and identifies active regions — intervals where there’s evidence of variation. Regions with no evidence of variants are skipped entirely. This is controlled by the active_probability_threshold — the minimum probability a site must have to be considered “active.”
2. Local Assembly
Within each active region, HaplotypeCaller builds a De Bruijn-like assembly graph from the reads. This graph captures all possible haplotypes (sequences of variants) supported by the read data. The assembler uses configurable kmer sizes and pruning thresholds to balance sensitivity against noise.
3. Pairwise HMM (Pair-HMM)
Each read is realigned against every candidate haplotype using a Pair Hidden Markov Model. This computes the likelihood of observing each read given each haplotype, accounting for base quality scores, mapping quality, and gap penalties. This is the most computationally expensive step.
4. Genotyping
Using the Pair-HMM likelihoods as input, the caller applies Bayesian genotyping with configurable priors (heterozygosity, indel heterozygosity) to determine the most likely genotype at each variant site.
Hyperparameters
Quality Filtering
| Parameter | Range | Default | Description |
|---|
min_base_quality_score | 0-50 | 10 | Minimum base quality to consider a base for calling. Bases below this are ignored entirely. Lower values increase sensitivity (more bases contribute) but may introduce noise from low-quality bases. Higher values improve specificity at the cost of losing some real signal. |
min_mapping_quality_score | 0-60 | 20 | Minimum mapping quality for a read to be used. Reads with mapQ below this are discarded. mapQ 0 means the read maps equally well to multiple locations. Higher values ensure reads are confidently mapped, reducing false positives from mismapped reads. Lower values retain more data but risk noise in repetitive regions. |
base_quality_score_threshold | 0-50 | 18 | Bases with quality below this threshold are “floored” to the minimum quality score. Acts as a soft filter — doesn’t remove bases but reduces their influence. Works in tandem with min_base_quality_score. |
Calling Confidence
| Parameter | Range | Default | Description |
|---|
standard_min_confidence_threshold_for_calling | 0-100 | 30.0 | Minimum phred-scaled confidence to emit a variant call. A value of 30 means 99.9% confidence. Lower values increase sensitivity — more variants called, but also more false positives. Higher values increase precision. |
emit_ref_confidence | NONE / GVCF / BP_RESOLUTION | NONE | Controls reference confidence output. NONE outputs only variant sites. GVCF emits reference blocks with confidence bands. BP_RESOLUTION emits every position. |
PCR Error Model
| Parameter | Values | Default | Description |
|---|
pcr_indel_model | NONE / HOSTILE / AGGRESSIVE / CONSERVATIVE | CONSERVATIVE | Controls how aggressively the caller filters PCR-induced indel artifacts. NONE disables the model. HOSTILE is the most aggressive filter (assumes many indels are PCR artifacts). CONSERVATIVE applies light filtering. AGGRESSIVE is in between. |
Assembly Graph
| Parameter | Range | Default | Description |
|---|
min_pruning | 1-10 | 2 | Minimum number of supporting reads for a path in the assembly graph to survive pruning. Lower values keep more paths, increasing sensitivity to low-frequency or low-coverage variants but also noise. Higher values require more evidence, improving precision. |
max_alternate_alleles | 1-20 | 6 | Maximum number of alternate alleles to genotype at a single site. Higher values allow complex multi-allelic sites but increase compute. Lower values force the caller to pick the top alleles. |
min_dangling_branch_length | 1-20 | 4 | Minimum length of a “dangling” branch (a path that doesn’t connect back to the reference) to attempt recovery via Smith-Waterman alignment. Lower values recover more potential variants from partial read evidence, improving sensitivity. Higher values reduce noise from short, unreliable branches. |
recover_all_dangling_branches | true/false | false | When enabled, attempts to recover ALL dangling branches regardless of length. Increases sensitivity but may introduce false positives from noise in the assembly graph. |
max_num_haplotypes_in_population | 8-512 | 128 | Maximum number of candidate haplotypes from the assembly graph to evaluate. Higher values explore more possibilities, potentially finding rare or complex variants, but increase compute time significantly. Lower values are faster but may miss variants in complex regions. |
adaptive_pruning_initial_error_rate | 0.0001-0.1 | 0.001 | Starting error rate for the probabilistic adaptive pruning model. This controls how aggressively low-support paths are pruned. Higher values assume more sequencing errors, pruning more aggressively (fewer false positives, potentially fewer true variants). Lower values are more permissive. |
pruning_lod_threshold | 0.5-10.0 | 2.302585 | Log-odds threshold for adaptive pruning. Paths with support below this threshold (relative to the error model) are pruned. The default of ~2.3 corresponds to ln(10), or 10:1 odds. Higher values prune more aggressively. Lower values keep more paths. This interacts with adaptive_pruning_initial_error_rate. |
Active Region Determination
| Parameter | Range | Default | Description |
|---|
active_probability_threshold | 0.0001-0.05 | 0.002 | Minimum probability for a locus to be considered “active” (worth investigating for variants). Acts as an early filter in the pipeline. Lower values examine more sites, increasing sensitivity at the cost of runtime. Higher values skip marginal sites. |
min_assembly_region_size | 1-300 | 50 | Minimum size (in bp) of an assembly region. Regions smaller than this are extended. Smaller values allow finer-grained analysis. Larger values provide more context for assembly but may dilute signal in isolated variant sites. |
max_assembly_region_size | 100-1000 | 300 | Maximum size of an assembly region. Regions larger than this are split. Larger values capture more context for complex variants but increase memory and compute. Smaller values may miss variants that span region boundaries. |
assembly_region_padding | 0-500 | 100 | Extra context bases added around each assembly region. Provides flanking sequence for the assembler. Higher values give more context, which helps with variants near region boundaries. Lower values reduce data processed per region. |
Pair-HMM / Likelihood Computation
| Parameter | Range | Default | Description |
|---|
pair_hmm_gap_continuation_penalty | 1-30 | 10 | Flat penalty (Phred-scale) for extending a gap in the Pair-HMM alignment. Higher values penalize longer indels more, reducing sensitivity to long indels but improving specificity. Lower values make the model more tolerant of long gaps. |
phred_scaled_global_read_mismapping_rate | 10-60 | 45 | Global assumed probability that any read is mismapped (Phred-scaled). A value of 45 means ~1 in 30,000 reads is assumed mismapped. Lower values make the caller more skeptical of read evidence — useful in repetitive regions. Higher values trust reads more. |
Genotyping Priors
| Parameter | Range | Default | Description |
|---|
heterozygosity | 0.0001-0.01 | 0.001 | Prior probability of SNP heterozygosity. The default 0.001 matches average human SNP diversity (~1 het per 1000 bases). Higher values make the caller more willing to call heterozygous SNPs (increased sensitivity). Lower values make it more conservative. |
indel_heterozygosity | 0.00001-0.001 | 0.000125 | Prior probability of indel heterozygosity. Similar to heterozygosity but for indels. Higher values increase indel sensitivity. |
sample_ploidy | 1-10 | 2 | Assumed ploidy. For human germline calling on autosomes, this should be 2 (diploid). Changing this fundamentally alters the genotyping model. |
contamination_fraction_to_filter | 0.0-0.5 | 0.0 | Estimated fraction of reads from contaminating DNA. When set above 0, the caller downsamples reads supporting alternate alleles proportionally. Useful if you suspect cross-sample contamination is inflating false positives. |
Downsampling
| Parameter | Range | Default | Description |
|---|
max_reads_per_alignment_start | 0-1000 | 50 | Maximum reads retained per alignment start position. 0 disables downsampling. At high coverage, many reads start at the same position. Downsampling reduces compute while retaining most information. |
Read Filtering
| Parameter | Values | Default | Description |
|---|
dont_use_soft_clipped_bases | true/false | false | When true, soft-clipped portions of reads are ignored. Soft clips often contain misaligned or adapter sequence that can create false variant calls. Enabling may improve precision in regions where reads have significant soft clipping. Disabling (default) uses all bases, which can help detect variants near read boundaries. |