Model Guide

crabML implements 15 codon substitution models for detecting natural selection. All models are fully validated against PAML reference outputs.

Model Categories

Models are organized into three categories based on how omega (dN/dS) varies:

Site-class models: omega varies across sites but not branches
Branch models: omega varies across branches but not sites
Branch-site models: omega varies across both sites and branches

Site-Class Models

These models allow the ratio of non-synonymous to synonymous substitutions (ω = dN/dS) to vary across codon sites.

M0: One-Ratio Model

The simplest model with a single omega for all sites.

from crabml import optimize_model

result = optimize_model("M0", "alignment.fasta", "tree.nwk")
print(f"omega = {result.omega:.4f}")

Use case: Baseline model, average selection across gene

Parameters: - kappa (transition/transversion ratio) - omega (dN/dS)

M1a: Nearly Neutral

Two site classes: purifying selection (ω₀ < 1) and neutral evolution (ω₁ = 1).

result = optimize_model("M1a", "alignment.fasta", "tree.nwk")
print(f"p0 = {result.params['p0']:.3f} (proportion ω < 1)")
print(f"omega0 = {result.params['omega0']:.3f}")

Use case: Null model for M1a vs M2a test

Parameters: - kappa - p0 (proportion of sites with ω₀) - omega0 (ω for purifying selection, 0 < ω₀ < 1)

M2a: Positive Selection

Three site classes: purifying (ω₀ < 1), neutral (ω₁ = 1), and positive selection (ω₂ > 1).

result = optimize_model("M2a", "alignment.fasta", "tree.nwk")
print(f"Omegas: {result.omegas}")  # [omega0, 1.0, omega2]
print(f"Proportions: {result.proportions}")  # [p0, p1, p2]

if result.params['omega2'] > 1:
    print(f"Positive selection detected: ω₂ = {result.params['omega2']:.2f}")

Use case: Detect positive selection (alternative to M1a)

Parameters: - kappa - p0, p1 (proportions, p2 = 1 - p0 - p1) - omega0, omega2

M3: Discrete Model

K discrete omega categories with estimated proportions.

result = optimize_model("M3", "alignment.fasta", "tree.nwk", K=3)
print(f"Omegas: {result.omegas}")
print(f"Proportions: {result.proportions}")

Use case: Exploratory analysis of omega variation

Parameters: - kappa - K omega values - K-1 proportion parameters

M7: Beta Distribution

Beta distribution for omega constrained to (0, 1).

result = optimize_model("M7", "alignment.fasta", "tree.nwk")
print(f"Beta parameters: p={result.params['p']:.2f}, q={result.params['q']:.2f}")

Use case: Null model for M7 vs M8 test (no positive selection)

Parameters: - kappa - p, q (beta distribution shape parameters)

M8: Beta + Omega

Beta distribution (0 < ω < 1) plus an additional class allowing ω > 1.

result = optimize_model("M8", "alignment.fasta", "tree.nwk")

if result.params['omega_s'] > 1:
    p_sel = result.params['p1']
    omega_sel = result.params['omega_s']
    print(f"{p_sel:.1%} of sites with ω = {omega_sel:.2f}")

Use case: Detect positive selection (alternative to M7)

Parameters: - kappa - p, q (beta distribution) - p0 (proportion in beta) - omega_s (omega for selection class)

M8a: Beta + Omega = 1

Like M8 but with the additional class fixed at ω = 1.

result = optimize_model("M8a", "alignment.fasta", "tree.nwk")

Use case: Null model for M8a vs M8 test

Parameters: - kappa - p, q (beta distribution) - p0 (proportion in beta)

Additional Site Models

M4 (Frequencies): Five fixed omegas with variable proportions

M5 (Gamma): Gamma distribution for omega

M6 (2Gamma): Mixture of two gamma distributions

M9 (Beta & Gamma): Mixture of beta and gamma distributions

m4 = optimize_model("M4", align, tree)
m5 = optimize_model("M5", align, tree)
m6 = optimize_model("M6", align, tree)
m9 = optimize_model("M9", align, tree)

Branch Models

Branch models allow omega to vary across phylogenetic lineages.

Free-Ratio Model

Estimates independent omega for each branch in the tree.

from crabml import optimize_branch_model

result = optimize_branch_model("free-ratio", "alignment.fasta", "tree.nwk")

# View all branch-specific omegas
for branch, omega in result.omega_dict.items():
    print(f"{branch}: ω = {omega:.3f}")

Use case: Exploratory analysis of lineage-specific selection

Warning: Highly parameter-rich, prone to overfitting with small datasets

Parameters: One omega per branch (n-1 for n species)

Multi-Ratio Model

Different omega for labeled branch groups.

# Label branches in tree: #0 = background, #1 = foreground
tree_str = "((human,chimp) #1, (mouse,rat) #0);"

result = optimize_branch_model("multi-ratio", align, tree_str)

print(f"Foreground ω (primates): {result.foreground_omega:.3f}")
print(f"Background ω (rodents): {result.background_omega:.3f}")

Use case: Test for lineage-specific selection (recommended over free-ratio)

Parameters: One omega per unique branch label

Branch-Site Models

Branch-site models combine site variation with lineage-specific effects.

Branch-Site Model A

Four site classes with different omega on foreground vs background branches:

Class 0: Conserved (ω₀ < 1) on all branches
Class 1: Neutral (ω = 1) on all branches
Class 2a: Conserved on background, positive selection (ω₂ > 1) on foreground
Class 2b: Neutral on background, positive selection on foreground

from crabml import optimize_branch_site_model

tree_str = "((human,chimp) #1, (mouse,rat) #0);"

# Alternative model (ω₂ free)
alt = optimize_branch_site_model("model-a", align, tree_str)

print(f"ω₀ (conserved): {alt.omega0:.3f}")
print(f"ω₂ (positive selection): {alt.omega2:.3f}")
print(f"Sites under selection: {alt.foreground_positive_proportion:.1%}")
print(f"Site class proportions: {alt.proportions}")

Use case: Detect positive selection on specific lineages

Parameters: - kappa - omega0, omega2 - p0, p1 (proportions)

Branch-Site Model A Null

Same as Model A but with ω₂ fixed to 1 for hypothesis testing.

null = optimize_branch_site_model("model-a", align, tree_str, fix_omega=True)

assert null.omega2 == 1.0  # Fixed at neutral

Use case: Null model for branch-site test

Model Selection

Choosing the Right Model

For detecting positive selection:

Start with standard tests: - M1a vs M2a (site-class test) - M7 vs M8 (more conservative)
If you suspect lineage-specific selection: - Branch-site Model A
For exploratory analysis: - M3 (discrete omegas) - Free-ratio (branch-specific, use with caution)

For estimating average selection:

M0 provides overall omega across gene

For detailed omega distribution:

M7/M8 model beta distribution
M3 estimates discrete categories

Computational Considerations

Speed (fastest to slowest):

M0 (one class)
M1a, M7 (two classes / beta)
M2a, M8, M8a (three classes)
M3, M9 (K classes)
Branch models (many omegas)
Branch-site models (complex structure)

Sample size requirements:

M0, M1a, M2a: Work with small datasets (< 10 sequences)
M7, M8: Benefit from moderate datasets (10-50 sequences)
Branch models: Need sufficient phylogenetic diversity
Branch-site models: Need both sequence diversity and alignment length

Model Comparison

Comparing Results

from crabml import compare_results

m0 = optimize_model("M0", align, tree)
m1a = optimize_model("M1a", align, tree)
m2a = optimize_model("M2a", align, tree)

# Compare models
comparison = compare_results([m0, m1a, m2a])
print(comparison)

This shows log-likelihoods, parameters, and omega estimates side-by-side.

PAML Compatibility

All crabML models are validated against PAML:

Log-likelihoods match within 0.01 units
Parameter estimates match within 1% relative error
Same model parameterizations as PAML
Tested on multiple diverse datasets

You can directly compare crabML results with PAML output files.