Model Guide =========== crabML implements 15 codon substitution models for detecting natural selection. All models are fully validated against PAML reference outputs. Model Categories ---------------- Models are organized into three categories based on how omega (dN/dS) varies: 1. **Site-class models**: omega varies across sites but not branches 2. **Branch models**: omega varies across branches but not sites 3. **Branch-site models**: omega varies across both sites and branches Site-Class Models ----------------- These models allow the ratio of non-synonymous to synonymous substitutions (ω = dN/dS) to vary across codon sites. M0: One-Ratio Model ~~~~~~~~~~~~~~~~~~~~ The simplest model with a single omega for all sites. .. code-block:: python from crabml import optimize_model result = optimize_model("M0", "alignment.fasta", "tree.nwk") print(f"omega = {result.omega:.4f}") **Use case**: Baseline model, average selection across gene **Parameters**: - kappa (transition/transversion ratio) - omega (dN/dS) M1a: Nearly Neutral ~~~~~~~~~~~~~~~~~~~ Two site classes: purifying selection (ω₀ < 1) and neutral evolution (ω₁ = 1). .. code-block:: python result = optimize_model("M1a", "alignment.fasta", "tree.nwk") print(f"p0 = {result.params['p0']:.3f} (proportion ω < 1)") print(f"omega0 = {result.params['omega0']:.3f}") **Use case**: Null model for M1a vs M2a test **Parameters**: - kappa - p0 (proportion of sites with ω₀) - omega0 (ω for purifying selection, 0 < ω₀ < 1) M2a: Positive Selection ~~~~~~~~~~~~~~~~~~~~~~~~ Three site classes: purifying (ω₀ < 1), neutral (ω₁ = 1), and positive selection (ω₂ > 1). .. code-block:: python result = optimize_model("M2a", "alignment.fasta", "tree.nwk") print(f"Omegas: {result.omegas}") # [omega0, 1.0, omega2] print(f"Proportions: {result.proportions}") # [p0, p1, p2] if result.params['omega2'] > 1: print(f"Positive selection detected: ω₂ = {result.params['omega2']:.2f}") **Use case**: Detect positive selection (alternative to M1a) **Parameters**: - kappa - p0, p1 (proportions, p2 = 1 - p0 - p1) - omega0, omega2 M3: Discrete Model ~~~~~~~~~~~~~~~~~~ K discrete omega categories with estimated proportions. .. code-block:: python result = optimize_model("M3", "alignment.fasta", "tree.nwk", K=3) print(f"Omegas: {result.omegas}") print(f"Proportions: {result.proportions}") **Use case**: Exploratory analysis of omega variation **Parameters**: - kappa - K omega values - K-1 proportion parameters M7: Beta Distribution ~~~~~~~~~~~~~~~~~~~~~ Beta distribution for omega constrained to (0, 1). .. code-block:: python result = optimize_model("M7", "alignment.fasta", "tree.nwk") print(f"Beta parameters: p={result.params['p']:.2f}, q={result.params['q']:.2f}") **Use case**: Null model for M7 vs M8 test (no positive selection) **Parameters**: - kappa - p, q (beta distribution shape parameters) M8: Beta + Omega ~~~~~~~~~~~~~~~~ Beta distribution (0 < ω < 1) plus an additional class allowing ω > 1. .. code-block:: python result = optimize_model("M8", "alignment.fasta", "tree.nwk") if result.params['omega_s'] > 1: p_sel = result.params['p1'] omega_sel = result.params['omega_s'] print(f"{p_sel:.1%} of sites with ω = {omega_sel:.2f}") **Use case**: Detect positive selection (alternative to M7) **Parameters**: - kappa - p, q (beta distribution) - p0 (proportion in beta) - omega_s (omega for selection class) M8a: Beta + Omega = 1 ~~~~~~~~~~~~~~~~~~~~~ Like M8 but with the additional class fixed at ω = 1. .. code-block:: python result = optimize_model("M8a", "alignment.fasta", "tree.nwk") **Use case**: Null model for M8a vs M8 test **Parameters**: - kappa - p, q (beta distribution) - p0 (proportion in beta) Additional Site Models ~~~~~~~~~~~~~~~~~~~~~~ **M4 (Frequencies)**: Five fixed omegas with variable proportions **M5 (Gamma)**: Gamma distribution for omega **M6 (2Gamma)**: Mixture of two gamma distributions **M9 (Beta & Gamma)**: Mixture of beta and gamma distributions .. code-block:: python m4 = optimize_model("M4", align, tree) m5 = optimize_model("M5", align, tree) m6 = optimize_model("M6", align, tree) m9 = optimize_model("M9", align, tree) Branch Models ------------- Branch models allow omega to vary across phylogenetic lineages. Free-Ratio Model ~~~~~~~~~~~~~~~~ Estimates independent omega for each branch in the tree. .. code-block:: python from crabml import optimize_branch_model result = optimize_branch_model("free-ratio", "alignment.fasta", "tree.nwk") # View all branch-specific omegas for branch, omega in result.omega_dict.items(): print(f"{branch}: ω = {omega:.3f}") **Use case**: Exploratory analysis of lineage-specific selection **Warning**: Highly parameter-rich, prone to overfitting with small datasets **Parameters**: One omega per branch (n-1 for n species) Multi-Ratio Model ~~~~~~~~~~~~~~~~~ Different omega for labeled branch groups. .. code-block:: python # Label branches in tree: #0 = background, #1 = foreground tree_str = "((human,chimp) #1, (mouse,rat) #0);" result = optimize_branch_model("multi-ratio", align, tree_str) print(f"Foreground ω (primates): {result.foreground_omega:.3f}") print(f"Background ω (rodents): {result.background_omega:.3f}") **Use case**: Test for lineage-specific selection (recommended over free-ratio) **Parameters**: One omega per unique branch label Branch-Site Models ------------------ Branch-site models combine site variation with lineage-specific effects. Branch-Site Model A ~~~~~~~~~~~~~~~~~~~ Four site classes with different omega on foreground vs background branches: - **Class 0**: Conserved (ω₀ < 1) on all branches - **Class 1**: Neutral (ω = 1) on all branches - **Class 2a**: Conserved on background, positive selection (ω₂ > 1) on foreground - **Class 2b**: Neutral on background, positive selection on foreground .. code-block:: python from crabml import optimize_branch_site_model tree_str = "((human,chimp) #1, (mouse,rat) #0);" # Alternative model (ω₂ free) alt = optimize_branch_site_model("model-a", align, tree_str) print(f"ω₀ (conserved): {alt.omega0:.3f}") print(f"ω₂ (positive selection): {alt.omega2:.3f}") print(f"Sites under selection: {alt.foreground_positive_proportion:.1%}") print(f"Site class proportions: {alt.proportions}") **Use case**: Detect positive selection on specific lineages **Parameters**: - kappa - omega0, omega2 - p0, p1 (proportions) Branch-Site Model A Null ~~~~~~~~~~~~~~~~~~~~~~~~~ Same as Model A but with ω₂ fixed to 1 for hypothesis testing. .. code-block:: python null = optimize_branch_site_model("model-a", align, tree_str, fix_omega=True) assert null.omega2 == 1.0 # Fixed at neutral **Use case**: Null model for branch-site test Model Selection --------------- Choosing the Right Model ~~~~~~~~~~~~~~~~~~~~~~~~ **For detecting positive selection:** 1. Start with standard tests: - M1a vs M2a (site-class test) - M7 vs M8 (more conservative) 2. If you suspect lineage-specific selection: - Branch-site Model A 3. For exploratory analysis: - M3 (discrete omegas) - Free-ratio (branch-specific, use with caution) **For estimating average selection:** - M0 provides overall omega across gene **For detailed omega distribution:** - M7/M8 model beta distribution - M3 estimates discrete categories Computational Considerations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Speed** (fastest to slowest): 1. M0 (one class) 2. M1a, M7 (two classes / beta) 3. M2a, M8, M8a (three classes) 4. M3, M9 (K classes) 5. Branch models (many omegas) 6. Branch-site models (complex structure) **Sample size requirements:** - M0, M1a, M2a: Work with small datasets (< 10 sequences) - M7, M8: Benefit from moderate datasets (10-50 sequences) - Branch models: Need sufficient phylogenetic diversity - Branch-site models: Need both sequence diversity and alignment length Model Comparison ---------------- Comparing Results ~~~~~~~~~~~~~~~~~ .. code-block:: python from crabml import compare_results m0 = optimize_model("M0", align, tree) m1a = optimize_model("M1a", align, tree) m2a = optimize_model("M2a", align, tree) # Compare models comparison = compare_results([m0, m1a, m2a]) print(comparison) This shows log-likelihoods, parameters, and omega estimates side-by-side. PAML Compatibility ------------------ All crabML models are validated against PAML: - Log-likelihoods match within 0.01 units - Parameter estimates match within 1% relative error - Same model parameterizations as PAML - Tested on multiple diverse datasets You can directly compare crabML results with PAML output files.