FreeBayes - Minos

Docker Image: staphb/freebayes:1.3.7

FreeBayes is a Bayesian haplotype-based variant caller developed by Erik Garrison. FreeBayes uses a direct haplotype evaluation method — it considers all possible alleles at a locus simultaneously and computes Bayesian posterior probabilities for each genotype. It was designed to be fast, simple, and capable of calling SNPs, indels, MNPs (multi-nucleotide polymorphisms), and complex events in a single pass. FreeBayes is notable for its rich set of Bayesian priors (population genetics, Hardy-Weinberg, allele balance) that can be individually toggled. This makes it the tool with the most “knobs” for controlling the statistical model.


Approach	Bayesian haplotype evaluation with toggleable priors
Strengths	Fine-grained prior control, MNPs, complex events
Parameters	22 across 9 categories
Compute	Medium

How It Works

1. Candidate Identification

FreeBayes scans the BAM file in the target region and identifies positions where reads differ from the reference. It applies coverage and allele fraction filters to determine which sites are worth evaluating.

2. Haplotype Construction

At each candidate site, FreeBayes considers all observed alleles and constructs candidate haplotypes within a configurable window (max_complex_gap). Unlike GATK, it does not perform full local assembly — instead, it directly evaluates the alleles observed in reads.

3. Bayesian Genotyping

For each candidate site, FreeBayes calculates:

Data likelihood: Probability of observing the reads given each possible genotype, using base quality and mapping quality
Prior probability: Based on the Ewens Sampling Formula (controlled by theta), Hardy-Weinberg Equilibrium, allele balance expectations, and binomial observation model
Posterior probability: Combined likelihood × prior, used to call the genotype

4. Output

Sites passing the posterior probability threshold (pvar) and quality filters are emitted as variant calls.

Hyperparameters

Quality Filtering

Parameter	Range	Default	Description
`min_mapping_quality`	0-60	1	Minimum mapping quality for a read to be used. The default of 1 is permissive — it keeps almost everything except reads that map equally well everywhere (mapQ=0). Higher values filter ambiguously mapped reads and can reduce false positives in repetitive regions.
`min_base_quality`	0-50	1	Minimum base quality for an allele observation. The default of 1 is extremely permissive. Higher values filter low-quality bases that contribute noise.
`base_quality_cap`	0-60	0	Cap all base qualities at this value. 0 means disabled (no cap). Some instruments report overly optimistic quality scores. Setting a cap prevents any single base from having outsized influence.

Allele Detection Thresholds

Parameter	Range	Default	Description
`min_alternate_fraction`	0.0-1.0	0.05	Minimum fraction of reads at a site supporting the alternate allele. Lower values increase sensitivity to mosaic or low-VAF variants. Higher values reduce noise. This is a critical sensitivity control.
`min_alternate_count`	1-100	2	Minimum absolute number of reads supporting the alternate allele. The hard floor for alt support. Lower values maximize sensitivity. Higher values provide stronger evidence requirements.
`min_alternate_qsum`	0-10000	0	Minimum sum of base qualities across all reads supporting the alternate allele. Acts as a quality-weighted version of `min_alternate_count`. Provides a smoother filter than a hard count threshold.

Coverage

Parameter	Range	Default	Description
`min_coverage`	0-1000	0	Minimum total read depth to process a site. 0 means process everything. Higher values skip extremely low-coverage sites where calling is unreliable.

Read Filtering

Parameter	Range	Default	Description
`mismatch_base_quality_threshold`	0-60	10	Base quality threshold for counting mismatches in read-level filters. Only mismatches at bases with quality >= this value count toward `read_max_mismatch_fraction`. Lower values count more mismatches (stricter filtering). Higher values only count high-confidence mismatches.
`read_max_mismatch_fraction`	0.0-1.0	1.0	Maximum fraction of read bases that can be mismatches before excluding the read. The default of 1.0 disables this filter entirely. Lower values remove reads with a high proportion of mismatches, which strongly suggests misalignment or contamination.

Genotype Likelihood / Priors

These parameters control the Bayesian statistical model and affect how FreeBayes weighs evidence and applies priors.

Parameter	Range	Default	Description
`theta`	0.0-0.1	0.001	Population-scaled mutation rate, used as the parameter for the Ewens Sampling Formula prior. Higher values make the caller believe variants are more common, increasing sensitivity (more calls). Lower values make it more conservative. The default of 0.001 matches typical human nucleotide diversity.
`read_dependence_factor`	0.0-1.0	0.9	Scaling factor for successive observations from the same position/strand. Models the non-independence of reads (due to PCR amplification). A value of 1.0 treats all reads as independent. Lower values discount redundant evidence more aggressively, reducing false positives from PCR bias. Higher values trust all reads equally. Strongly affects the sensitivity/specificity tradeoff.
`pvar`	0.0-1.0	0.0	Minimum posterior probability to report a variant. At 0.0, all sites passing filters are reported. Higher values act as a Bayesian quality gate, only reporting variants the model is confident about.
`use_mapping_quality`	true/false	false	Incorporate mapping quality into data likelihood calculations. When enabled, reads with lower mapping quality contribute less to the genotype likelihood. Enabling can improve accuracy in regions with ambiguous mappings.
`harmonic_indel_quality`	true/false	false	Use the harmonic mean of flanking base qualities for indels instead of the minimum. The harmonic mean is more nuanced than the minimum and can provide better indel quality estimates.

Prior Model Toggles

FreeBayes applies several Bayesian priors by default. Turning them off removes assumptions about population genetics.

Parameter	Values	Default	Description
`hwe_priors_off`	true/false	false	Disable Hardy-Weinberg Equilibrium prior. HWE priors favor genotypes consistent with expected population frequencies (e.g., if alt allele frequency is 0.3, the prior favors het over hom-alt).
`binomial_obs_priors_off`	true/false	false	Disable binomial observation priors. These model the expected distribution of allele observations given a genotype (e.g., a het should show ~50% alt reads). Useful when observation distributions are systematically skewed.
`allele_balance_priors_off`	true/false	false	Disable allele balance probability prior. This prior penalizes genotypes where the observed allele balance doesn’t match expectations. Similar to binomial priors but operates at the aggregate level.

Contamination

Parameter	Range	Default	Description
`prob_contamination`	0.0-1.0	0.0	Prior probability that a read comes from contaminating DNA. Higher values raise the bar for calling heterozygous variants, as low-frequency alleles might be attributed to contamination.

Population Genetics

Parameter	Range	Default	Description
`ploidy`	1-10	2	Assumed ploidy. 2 for diploid human calling on autosomes. Changing this fundamentally alters the genotyping model.
`use_best_n_alleles`	0-20	0	Limit evaluation to the N best SNP alleles. 0 means evaluate all observed alleles. Lower values can speed up calling at multi-allelic sites without losing accuracy for typical biallelic variants.

Haplotype / Complex Variants

Parameter	Range	Default	Description
`max_complex_gap`	0-100	3	Maximum distance (in bp) between variants that can be grouped into a single complex allele (MNP or complex event). Higher values allow FreeBayes to call complex events spanning more bases. Lower values force variants to be called individually.
`min_repeat_entropy`	0-4	1	Minimum Shannon entropy (in bits) for a repeat to trigger repeat-aware calling. Lower values are more permissive (more regions treated as repeats). Higher values only flag highly repetitive regions.
`min_repeat_size`	1-100	5	Minimum total length (in bp) of a short tandem repeat region to trigger repeat-aware calling. Lower values apply repeat handling to shorter repeats. Higher values only activate for longer repeat tracts. Affects accuracy in homopolymer and STR regions.

Algorithm

Parameter	Range	Default	Description
`genotyping_max_banddepth`	1-20	7	Maximum depth of the banded genotype likelihood calculation. Controls how many alternative genotypes are evaluated per sample. Higher values allow more thorough exploration of the genotype space at multi-allelic sites. Lower values are faster but may miss the correct genotype at complex sites.

​How It Works

​1. Candidate Identification

​2. Haplotype Construction

​3. Bayesian Genotyping

​4. Output

​Hyperparameters

​Quality Filtering

​Allele Detection Thresholds

​Coverage

​Read Filtering

​Genotype Likelihood / Priors

​Prior Model Toggles

​Contamination

​Population Genetics

​Haplotype / Complex Variants

​Algorithm

How It Works

1. Candidate Identification

2. Haplotype Construction

3. Bayesian Genotyping

4. Output

Hyperparameters

Quality Filtering

Allele Detection Thresholds

Coverage

Read Filtering

Genotype Likelihood / Priors

Prior Model Toggles

Contamination

Population Genetics

Haplotype / Complex Variants

Algorithm