Model Guide

crabML implements 15 codon substitution models for detecting natural selection. All models are fully validated against PAML reference outputs.

Model Categories

Models are organized into three categories based on how omega (dN/dS) varies:

  1. Site-class models: omega varies across sites but not branches

  2. Branch models: omega varies across branches but not sites

  3. Branch-site models: omega varies across both sites and branches

Site-Class Models

These models allow the ratio of non-synonymous to synonymous substitutions (ω = dN/dS) to vary across codon sites.

M0: One-Ratio Model

The simplest model with a single omega for all sites.

from crabml import optimize_model

result = optimize_model("M0", "alignment.fasta", "tree.nwk")
print(f"omega = {result.omega:.4f}")

Use case: Baseline model, average selection across gene

Parameters: - kappa (transition/transversion ratio) - omega (dN/dS)

M1a: Nearly Neutral

Two site classes: purifying selection (ω₀ < 1) and neutral evolution (ω₁ = 1).

result = optimize_model("M1a", "alignment.fasta", "tree.nwk")
print(f"p0 = {result.params['p0']:.3f} (proportion ω < 1)")
print(f"omega0 = {result.params['omega0']:.3f}")

Use case: Null model for M1a vs M2a test

Parameters: - kappa - p0 (proportion of sites with ω₀) - omega0 (ω for purifying selection, 0 < ω₀ < 1)

M2a: Positive Selection

Three site classes: purifying (ω₀ < 1), neutral (ω₁ = 1), and positive selection (ω₂ > 1).

result = optimize_model("M2a", "alignment.fasta", "tree.nwk")
print(f"Omegas: {result.omegas}")  # [omega0, 1.0, omega2]
print(f"Proportions: {result.proportions}")  # [p0, p1, p2]

if result.params['omega2'] > 1:
    print(f"Positive selection detected: ω₂ = {result.params['omega2']:.2f}")

Use case: Detect positive selection (alternative to M1a)

Parameters: - kappa - p0, p1 (proportions, p2 = 1 - p0 - p1) - omega0, omega2

M3: Discrete Model

K discrete omega categories with estimated proportions.

result = optimize_model("M3", "alignment.fasta", "tree.nwk", K=3)
print(f"Omegas: {result.omegas}")
print(f"Proportions: {result.proportions}")

Use case: Exploratory analysis of omega variation

Parameters: - kappa - K omega values - K-1 proportion parameters

M7: Beta Distribution

Beta distribution for omega constrained to (0, 1).

result = optimize_model("M7", "alignment.fasta", "tree.nwk")
print(f"Beta parameters: p={result.params['p']:.2f}, q={result.params['q']:.2f}")

Use case: Null model for M7 vs M8 test (no positive selection)

Parameters: - kappa - p, q (beta distribution shape parameters)

M8: Beta + Omega

Beta distribution (0 < ω < 1) plus an additional class allowing ω > 1.

result = optimize_model("M8", "alignment.fasta", "tree.nwk")

if result.params['omega_s'] > 1:
    p_sel = result.params['p1']
    omega_sel = result.params['omega_s']
    print(f"{p_sel:.1%} of sites with ω = {omega_sel:.2f}")

Use case: Detect positive selection (alternative to M7)

Parameters: - kappa - p, q (beta distribution) - p0 (proportion in beta) - omega_s (omega for selection class)

M8a: Beta + Omega = 1

Like M8 but with the additional class fixed at ω = 1.

result = optimize_model("M8a", "alignment.fasta", "tree.nwk")

Use case: Null model for M8a vs M8 test

Parameters: - kappa - p, q (beta distribution) - p0 (proportion in beta)

Additional Site Models

M4 (Frequencies): Five fixed omegas with variable proportions

M5 (Gamma): Gamma distribution for omega

M6 (2Gamma): Mixture of two gamma distributions

M9 (Beta & Gamma): Mixture of beta and gamma distributions

m4 = optimize_model("M4", align, tree)
m5 = optimize_model("M5", align, tree)
m6 = optimize_model("M6", align, tree)
m9 = optimize_model("M9", align, tree)

Branch Models

Branch models allow omega to vary across phylogenetic lineages.

Free-Ratio Model

Estimates independent omega for each branch in the tree.

from crabml import optimize_branch_model

result = optimize_branch_model("free-ratio", "alignment.fasta", "tree.nwk")

# View all branch-specific omegas
for branch, omega in result.omega_dict.items():
    print(f"{branch}: ω = {omega:.3f}")

Use case: Exploratory analysis of lineage-specific selection

Warning: Highly parameter-rich, prone to overfitting with small datasets

Parameters: One omega per branch (n-1 for n species)

Multi-Ratio Model

Different omega for labeled branch groups.

# Label branches in tree: #0 = background, #1 = foreground
tree_str = "((human,chimp) #1, (mouse,rat) #0);"

result = optimize_branch_model("multi-ratio", align, tree_str)

print(f"Foreground ω (primates): {result.foreground_omega:.3f}")
print(f"Background ω (rodents): {result.background_omega:.3f}")

Use case: Test for lineage-specific selection (recommended over free-ratio)

Parameters: One omega per unique branch label

Branch-Site Models

Branch-site models combine site variation with lineage-specific effects.

Branch-Site Model A

Four site classes with different omega on foreground vs background branches:

  • Class 0: Conserved (ω₀ < 1) on all branches

  • Class 1: Neutral (ω = 1) on all branches

  • Class 2a: Conserved on background, positive selection (ω₂ > 1) on foreground

  • Class 2b: Neutral on background, positive selection on foreground

from crabml import optimize_branch_site_model

tree_str = "((human,chimp) #1, (mouse,rat) #0);"

# Alternative model (ω₂ free)
alt = optimize_branch_site_model("model-a", align, tree_str)

print(f"ω₀ (conserved): {alt.omega0:.3f}")
print(f"ω₂ (positive selection): {alt.omega2:.3f}")
print(f"Sites under selection: {alt.foreground_positive_proportion:.1%}")
print(f"Site class proportions: {alt.proportions}")

Use case: Detect positive selection on specific lineages

Parameters: - kappa - omega0, omega2 - p0, p1 (proportions)

Branch-Site Model A Null

Same as Model A but with ω₂ fixed to 1 for hypothesis testing.

null = optimize_branch_site_model("model-a", align, tree_str, fix_omega=True)

assert null.omega2 == 1.0  # Fixed at neutral

Use case: Null model for branch-site test

Model Selection

Choosing the Right Model

For detecting positive selection:

  1. Start with standard tests: - M1a vs M2a (site-class test) - M7 vs M8 (more conservative)

  2. If you suspect lineage-specific selection: - Branch-site Model A

  3. For exploratory analysis: - M3 (discrete omegas) - Free-ratio (branch-specific, use with caution)

For estimating average selection:

  • M0 provides overall omega across gene

For detailed omega distribution:

  • M7/M8 model beta distribution

  • M3 estimates discrete categories

Computational Considerations

Speed (fastest to slowest):

  1. M0 (one class)

  2. M1a, M7 (two classes / beta)

  3. M2a, M8, M8a (three classes)

  4. M3, M9 (K classes)

  5. Branch models (many omegas)

  6. Branch-site models (complex structure)

Sample size requirements:

  • M0, M1a, M2a: Work with small datasets (< 10 sequences)

  • M7, M8: Benefit from moderate datasets (10-50 sequences)

  • Branch models: Need sufficient phylogenetic diversity

  • Branch-site models: Need both sequence diversity and alignment length

Model Comparison

Comparing Results

from crabml import compare_results

m0 = optimize_model("M0", align, tree)
m1a = optimize_model("M1a", align, tree)
m2a = optimize_model("M2a", align, tree)

# Compare models
comparison = compare_results([m0, m1a, m2a])
print(comparison)

This shows log-likelihoods, parameters, and omega estimates side-by-side.

PAML Compatibility

All crabML models are validated against PAML:

  • Log-likelihoods match within 0.01 units

  • Parameter estimates match within 1% relative error

  • Same model parameterizations as PAML

  • Tested on multiple diverse datasets

You can directly compare crabML results with PAML output files.