DeepVariant - Minos

Docker Image: google/deepvariant:1.5.0

DeepVariant is Google’s deep learning-based variant caller. Instead of using probabilistic models like GATK or FreeBayes, it treats variant calling as an image classification problem. It converts read pileups into images and uses a convolutional neural network (CNN) — specifically an Inception v3 architecture — to classify each candidate site as homozygous reference, heterozygous variant, or homozygous variant.


Approach	CNN image classification on pileup tensors
Strengths	SNPs, platform-aware accuracy, simpler tuning
Parameters	15 across 6 categories
Compute	High — GPU optional but recommended

How It Works

DeepVariant operates in three stages:

1. make_examples

Scans the BAM file and identifies candidate variant sites. For each candidate, it constructs a pileup image — a multi-channel tensor encoding read alignments, base qualities, mapping qualities, strand information, and other signals. This is where most quality-affecting parameters live (candidate thresholds, read quality filters, read processing options).

2. call_variants

Feeds each pileup image through the pre-trained CNN model. The model outputs probabilities for three genotype classes: 0/0 (hom-ref), 0/1 (het), and 1/1 (hom-alt). The model was trained on truth sets and is specific to the sequencing platform (WGS, WES, PacBio, etc.).

3. postprocess_variants

Converts model predictions to a VCF file. Applies quality filters (QUAL thresholds), handles multi-allelic sites, and computes genotype quality (GQ) scores.

Unlike GATK/FreeBayes where hyperparameters control the statistical model, DeepVariant parameters affect what data the CNN sees and how outputs are filtered. You cannot change the model itself — but you can significantly affect accuracy by controlling inputs and output filtering.

Hyperparameters

Model Selection (Top-level)

Parameter	Values	Default	Description
`model_type`	WGS / WES / PACBIO / HYBRID_PACBIO_ILLUMINA	WGS	Selects the pre-trained model. Each model is trained on data from a specific sequencing platform. Using the wrong model for your data type will significantly degrade accuracy.

Candidate Variant Thresholds (make_examples)

These control which genomic positions become candidates for the CNN to evaluate. Lowering thresholds = more candidates = higher sensitivity but more computation and potentially more false positives (though the CNN is the final arbiter).

Parameter	Range	Default	Description
`vsc_min_fraction_snps`	0.0-1.0	0.12	Minimum fraction of reads supporting an alternate allele to consider a site as a SNP candidate. Lower values generate more candidates, increasing sensitivity for low-VAF variants. Higher values reduce candidates, potentially missing real het variants with skewed allele balance.
`vsc_min_fraction_indels`	0.0-1.0	0.12	Same as above but for indel candidates. Indels often have lower allele fractions due to alignment artifacts.
`vsc_min_count_snps`	0-50	2	Minimum absolute number of reads supporting an alternate allele for SNP candidates. Works alongside `vsc_min_fraction_snps` — both must be satisfied.
`vsc_min_count_indels`	0-50	2	Same as above but for indels.

Read Quality Filters (make_examples)

Parameter	Range	Default	Description
`min_mapping_quality`	0-60	5	Minimum mapping quality to include a read. Reads with mapQ below this are excluded from pileup images. Higher values ensure confidently mapped reads, reducing noise in the pileup image. Lower values include more reads. The default of 5 is very permissive — the CNN can typically handle some ambiguously mapped reads.
`min_base_quality`	0-50	10	Minimum base quality for counting alternate allele support. Bases below this quality are not counted toward the `vsc_min_count` and `vsc_min_fraction` thresholds (but may still appear in the pileup image). Higher values raise the bar for candidate generation.

Read Processing (make_examples)

Parameter	Values	Default	Description
`realign_reads`	true/false	true	Perform local realignment of reads before constructing pileup images. Realignment corrects misalignments around indels, improving the pileup image quality. Disabling saves compute time but may reduce accuracy, especially for indels. Reads longer than 500bp are never realigned regardless.
`normalize_reads`	true/false	false	Left-align indels in each read within the allele counter. Can improve consistency of indel representation across reads.
`keep_duplicates`	true/false	false	Include reads marked as PCR duplicates. By default, duplicates are excluded. Enabling increases read depth (more data for the CNN) but introduces PCR bias.
`max_reads_per_partition`	100-5000	1500	Maximum number of reads per genomic partition. At high-coverage sites, reads are downsampled to this count. The DP and AD values in the output VCF are capped by this value.

Haplotype-Aware Calling (make_examples)

Parameter	Values	Default	Description
`sort_by_haplotypes`	true/false	false	Sort reads by haplotype (HP tag) in the pileup image. When enabled, reads from the same haplotype are grouped together, making it easier for the CNN to see variant phasing patterns. Requires either a pre-phased BAM or `phase_reads=true`.
`phase_reads`	true/false	false	Compute read phases on-the-fly and assign HP tags. Enables haplotype-aware calling without requiring pre-phased input. The DeepVariant 1.5.0 WGS model was trained with phasing enabled. This adds compute time.

Post-Processing (postprocess_variants)

Parameter	Range	Default	Description
`qual_filter`	0.0-50.0	1.0	QUAL score threshold below which variants are marked as FILTERED (not PASS) in the output VCF. Higher values filter out more low-confidence calls, potentially improving precision. Lower values keep more calls, improving recall.
`multi_allelic_qual_filter`	0.0-50.0	1.0	Separate QUAL filter threshold for multi-allelic sites. These sites are inherently harder to call correctly.
`cnn_homref_call_min_gq`	0.0-50.0	20.0	Minimum genotype quality for homozygous-reference calls from the CNN. Calls with GQ below this are given an uncertain `./.` genotype. Affects reference confidence, not variant calls directly.
`use_multiallelic_model`	true/false	false	Use a specialized model for resolving multi-allelic genotypes (sites with 2+ alternate alleles). Enabling may improve accuracy at complex multi-allelic sites.

​How It Works

​1. make_examples

​2. call_variants

​3. postprocess_variants

​Hyperparameters

​Model Selection (Top-level)

​Candidate Variant Thresholds (make_examples)

​Read Quality Filters (make_examples)

​Read Processing (make_examples)

​Haplotype-Aware Calling (make_examples)

​Post-Processing (postprocess_variants)

How It Works

1. make_examples

2. call_variants

3. postprocess_variants

Hyperparameters

Model Selection (Top-level)

Candidate Variant Thresholds (make_examples)

Read Quality Filters (make_examples)

Read Processing (make_examples)

Haplotype-Aware Calling (make_examples)

Post-Processing (postprocess_variants)