Docker Image: staphb/freebayes:1.3.7
FreeBayes is a Bayesian haplotype-based variant caller developed by Erik Garrison. FreeBayes uses a direct haplotype evaluation method — it considers all possible alleles at a locus simultaneously and computes Bayesian posterior probabilities for each genotype. It was designed to be fast, simple, and capable of calling SNPs, indels, MNPs (multi-nucleotide polymorphisms), and complex events in a single pass.
FreeBayes is notable for its rich set of Bayesian priors (population genetics, Hardy-Weinberg, allele balance) that can be individually toggled. This makes it the tool with the most “knobs” for controlling the statistical model.
| |
|---|
| Approach | Bayesian haplotype evaluation with toggleable priors |
| Strengths | Fine-grained prior control, MNPs, complex events |
| Parameters | 22 across 9 categories |
| Compute | Medium |
How It Works
1. Candidate Identification
FreeBayes scans the BAM file in the target region and identifies positions where reads differ from the reference. It applies coverage and allele fraction filters to determine which sites are worth evaluating.
2. Haplotype Construction
At each candidate site, FreeBayes considers all observed alleles and constructs candidate haplotypes within a configurable window (max_complex_gap). Unlike GATK, it does not perform full local assembly — instead, it directly evaluates the alleles observed in reads.
3. Bayesian Genotyping
For each candidate site, FreeBayes calculates:
- Data likelihood: Probability of observing the reads given each possible genotype, using base quality and mapping quality
- Prior probability: Based on the Ewens Sampling Formula (controlled by
theta), Hardy-Weinberg Equilibrium, allele balance expectations, and binomial observation model
- Posterior probability: Combined likelihood × prior, used to call the genotype
4. Output
Sites passing the posterior probability threshold (pvar) and quality filters are emitted as variant calls.
Hyperparameters
Quality Filtering
| Parameter | Range | Default | Description |
|---|
min_mapping_quality | 0-60 | 1 | Minimum mapping quality for a read to be used. The default of 1 is permissive — it keeps almost everything except reads that map equally well everywhere (mapQ=0). Higher values filter ambiguously mapped reads and can reduce false positives in repetitive regions. |
min_base_quality | 0-50 | 1 | Minimum base quality for an allele observation. The default of 1 is extremely permissive. Higher values filter low-quality bases that contribute noise. |
base_quality_cap | 0-60 | 0 | Cap all base qualities at this value. 0 means disabled (no cap). Some instruments report overly optimistic quality scores. Setting a cap prevents any single base from having outsized influence. |
Allele Detection Thresholds
| Parameter | Range | Default | Description |
|---|
min_alternate_fraction | 0.0-1.0 | 0.05 | Minimum fraction of reads at a site supporting the alternate allele. Lower values increase sensitivity to mosaic or low-VAF variants. Higher values reduce noise. This is a critical sensitivity control. |
min_alternate_count | 1-100 | 2 | Minimum absolute number of reads supporting the alternate allele. The hard floor for alt support. Lower values maximize sensitivity. Higher values provide stronger evidence requirements. |
min_alternate_qsum | 0-10000 | 0 | Minimum sum of base qualities across all reads supporting the alternate allele. Acts as a quality-weighted version of min_alternate_count. Provides a smoother filter than a hard count threshold. |
Coverage
| Parameter | Range | Default | Description |
|---|
min_coverage | 0-1000 | 0 | Minimum total read depth to process a site. 0 means process everything. Higher values skip extremely low-coverage sites where calling is unreliable. |
Read Filtering
| Parameter | Range | Default | Description |
|---|
mismatch_base_quality_threshold | 0-60 | 10 | Base quality threshold for counting mismatches in read-level filters. Only mismatches at bases with quality >= this value count toward read_max_mismatch_fraction. Lower values count more mismatches (stricter filtering). Higher values only count high-confidence mismatches. |
read_max_mismatch_fraction | 0.0-1.0 | 1.0 | Maximum fraction of read bases that can be mismatches before excluding the read. The default of 1.0 disables this filter entirely. Lower values remove reads with a high proportion of mismatches, which strongly suggests misalignment or contamination. |
Genotype Likelihood / Priors
These parameters control the Bayesian statistical model and affect how FreeBayes weighs evidence and applies priors.
| Parameter | Range | Default | Description |
|---|
theta | 0.0-0.1 | 0.001 | Population-scaled mutation rate, used as the parameter for the Ewens Sampling Formula prior. Higher values make the caller believe variants are more common, increasing sensitivity (more calls). Lower values make it more conservative. The default of 0.001 matches typical human nucleotide diversity. |
read_dependence_factor | 0.0-1.0 | 0.9 | Scaling factor for successive observations from the same position/strand. Models the non-independence of reads (due to PCR amplification). A value of 1.0 treats all reads as independent. Lower values discount redundant evidence more aggressively, reducing false positives from PCR bias. Higher values trust all reads equally. Strongly affects the sensitivity/specificity tradeoff. |
pvar | 0.0-1.0 | 0.0 | Minimum posterior probability to report a variant. At 0.0, all sites passing filters are reported. Higher values act as a Bayesian quality gate, only reporting variants the model is confident about. |
use_mapping_quality | true/false | false | Incorporate mapping quality into data likelihood calculations. When enabled, reads with lower mapping quality contribute less to the genotype likelihood. Enabling can improve accuracy in regions with ambiguous mappings. |
harmonic_indel_quality | true/false | false | Use the harmonic mean of flanking base qualities for indels instead of the minimum. The harmonic mean is more nuanced than the minimum and can provide better indel quality estimates. |
Prior Model Toggles
FreeBayes applies several Bayesian priors by default. Turning them off removes assumptions about population genetics.
| Parameter | Values | Default | Description |
|---|
hwe_priors_off | true/false | false | Disable Hardy-Weinberg Equilibrium prior. HWE priors favor genotypes consistent with expected population frequencies (e.g., if alt allele frequency is 0.3, the prior favors het over hom-alt). |
binomial_obs_priors_off | true/false | false | Disable binomial observation priors. These model the expected distribution of allele observations given a genotype (e.g., a het should show ~50% alt reads). Useful when observation distributions are systematically skewed. |
allele_balance_priors_off | true/false | false | Disable allele balance probability prior. This prior penalizes genotypes where the observed allele balance doesn’t match expectations. Similar to binomial priors but operates at the aggregate level. |
Contamination
| Parameter | Range | Default | Description |
|---|
prob_contamination | 0.0-1.0 | 0.0 | Prior probability that a read comes from contaminating DNA. Higher values raise the bar for calling heterozygous variants, as low-frequency alleles might be attributed to contamination. |
Population Genetics
| Parameter | Range | Default | Description |
|---|
ploidy | 1-10 | 2 | Assumed ploidy. 2 for diploid human calling on autosomes. Changing this fundamentally alters the genotyping model. |
use_best_n_alleles | 0-20 | 0 | Limit evaluation to the N best SNP alleles. 0 means evaluate all observed alleles. Lower values can speed up calling at multi-allelic sites without losing accuracy for typical biallelic variants. |
Haplotype / Complex Variants
| Parameter | Range | Default | Description |
|---|
max_complex_gap | 0-100 | 3 | Maximum distance (in bp) between variants that can be grouped into a single complex allele (MNP or complex event). Higher values allow FreeBayes to call complex events spanning more bases. Lower values force variants to be called individually. |
min_repeat_entropy | 0-4 | 1 | Minimum Shannon entropy (in bits) for a repeat to trigger repeat-aware calling. Lower values are more permissive (more regions treated as repeats). Higher values only flag highly repetitive regions. |
min_repeat_size | 1-100 | 5 | Minimum total length (in bp) of a short tandem repeat region to trigger repeat-aware calling. Lower values apply repeat handling to shorter repeats. Higher values only activate for longer repeat tracts. Affects accuracy in homopolymer and STR regions. |
Algorithm
| Parameter | Range | Default | Description |
|---|
genotyping_max_banddepth | 1-20 | 7 | Maximum depth of the banded genotype likelihood calculation. Controls how many alternative genotypes are evaluated per sample. Higher values allow more thorough exploration of the genotype space at multi-allelic sites. Lower values are faster but may miss the correct genotype at complex sites. |