Machine Learning in Bioinformatics

Machine Learning in Bioinformatics banner

Machine learning helps detect patterns in high-dimensional biological data and build predictive models for diagnosis, prognosis, and discovery.

Learning Goals

By the end of this chapter, you should be able to:

Explain key ML concepts with bioinformatics examples.
Distinguish supervised and unsupervised learning.
Choose suitable evaluation metrics.
Identify overfitting, data leakage, and class imbalance issues.

When to Use Machine Learning

Use ML when:

You have many features (for example, thousands of genes).
Relationships are complex or non-linear.
Prediction is an explicit goal.

Do not use ML if a simple statistical model answers the question more clearly.

End-to-End ML Workflow

Machine Learning Workflow

Define prediction target (for example, disease vs. control).
Split data into train/validation/test sets.
Preprocess features (missing values, scaling, encoding).
Train baseline and advanced models.
Evaluate with appropriate metrics.
Interpret model outputs biologically.
Validate on external data when possible.

Supervised Learning

Used when labels are known.

Tasks:

Classification: tumor subtype, responder/non-responder
Regression: continuous outcomes (for example, survival risk score)

Common algorithms:

Logistic Regression
Random Forest
Support Vector Machine
Gradient Boosting
Neural Networks

Complete Classification Pipeline

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Load data
X = pd.read_csv("expression_matrix.csv", index_col=0)  # genes x samples
y = pd.read_csv("labels.csv")["condition"].values  # 0/1 labels

# CORRECT: Split BEFORE any preprocessing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Build pipeline (scaling inside CV loop prevents leakage)
pipelines = {
    "Logistic Regression": Pipeline([
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression(penalty="l1", solver="saga", max_iter=1000))
    ]),
    "Random Forest": Pipeline([
        ("clf", RandomForestClassifier(n_estimators=200, random_state=42))
    ]),
    "SVM": Pipeline([
        ("scaler", StandardScaler()),
        ("clf", SVC(kernel="rbf", probability=True))
    ]),
    "Gradient Boosting": Pipeline([
        ("clf", GradientBoostingClassifier(n_estimators=100, random_state=42))
    ])
}

# Cross-validation comparison
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

results = {}
for name, pipeline in pipelines.items():
    scores = cross_val_score(pipeline, X_train, y_train, cv=cv, scoring="roc_auc")
    results[name] = scores
    print(f"{name}: AUC = {scores.mean():.3f} ± {scores.std():.3f}")

# Train best model on full training set
best_model = pipelines["Random Forest"]
best_model.fit(X_train, y_train)

# Evaluate on held-out test set
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

print("\nTest Set Performance:")
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")

# Plot ROC curve
fpr, tpr, _ = roc_curve(y_test, y_proba)
plt.figure(figsize=(6, 6))
plt.plot(fpr, tpr, label=f"AUC = {roc_auc_score(y_test, y_proba):.3f}")
plt.plot([0, 1], [0, 1], "k--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve - Disease Classification")
plt.legend()
plt.savefig("roc_curve.png", dpi=150)

Regression Example: Predicting Drug Response

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from scipy.stats import spearmanr

# IC50 prediction from gene expression
X_train, X_test, y_train, y_test = train_test_split(
    expression, ic50_values, test_size=0.2, random_state=42
)

model = RandomForestRegressor(n_estimators=200, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
print(f"R²: {r2_score(y_test, y_pred):.3f}")
print(f"Spearman ρ: {spearmanr(y_test, y_pred)[0]:.3f}")

Unsupervised Learning

Used when labels are unknown.

Common uses in bioinformatics:

Discovering cell populations in single-cell data
Patient stratification by molecular profile
Detecting hidden batch effects

Common methods:

Clustering (k-means, hierarchical, Leiden)
Dimensionality reduction (PCA, UMAP, t-SNE)

Feature Engineering and Selection

Biological datasets often contain noise and redundant variables.

Practical steps:

Remove near-zero variance genes
Select highly variable genes
Use regularization (L1/L2)
Consider pathway-level features for interpretability

Feature Selection Pipeline

from sklearn.feature_selection import (
    VarianceThreshold, SelectKBest, f_classif, RFE
)
from sklearn.preprocessing import StandardScaler
import numpy as np

# Step 1: Remove near-zero variance genes
var_threshold = VarianceThreshold(threshold=0.01)
X_var = var_threshold.fit_transform(X_train)
selected_genes = X_train.columns[var_threshold.get_support()]
print(f"After variance filter: {len(selected_genes)} genes")

# Step 2: Univariate feature selection (ANOVA F-test)
selector = SelectKBest(f_classif, k=1000)
X_selected = selector.fit_transform(X_var, y_train)
top_genes = selected_genes[selector.get_support()]

# Step 3: Recursive Feature Elimination with CV
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier

rfe = RFECV(
    estimator=RandomForestClassifier(n_estimators=100, random_state=42),
    step=0.1,  # Remove 10% of features each iteration
    cv=5,
    scoring="roc_auc",
    min_features_to_select=10
)
rfe.fit(X_train[top_genes], y_train)

final_genes = top_genes[rfe.support_]
print(f"RFE selected {len(final_genes)} genes")

# Plot feature importance
importances = rfe.estimator_.feature_importances_
indices = np.argsort(importances)[-20:]  # Top 20

plt.figure(figsize=(10, 8))
plt.barh(range(20), importances[indices])
plt.yticks(range(20), final_genes[indices])
plt.xlabel("Feature Importance")
plt.title("Top 20 Predictive Genes")
plt.tight_layout()
plt.savefig("feature_importance.png", dpi=150)

L1 Regularization for Sparse Selection

from sklearn.linear_model import LogisticRegressionCV

# L1 penalty automatically selects features
logreg_l1 = LogisticRegressionCV(
    penalty="l1",
    solver="saga",
    cv=5,
    Cs=np.logspace(-4, 4, 20),
    max_iter=1000,
    random_state=42
)
logreg_l1.fit(X_train_scaled, y_train)

# Non-zero coefficients = selected features
selected_mask = logreg_l1.coef_[0] != 0
l1_genes = X_train.columns[selected_mask]
print(f"L1 selected {len(l1_genes)} genes")

Dealing with Highly Correlated Features

# Remove highly correlated genes (keep one representative)
def remove_correlated(X, threshold=0.95):
    corr_matrix = X.corr().abs()
    upper = corr_matrix.where(
        np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
    )
    to_drop = [col for col in upper.columns if any(upper[col] > threshold)]
    return X.drop(columns=to_drop)

X_uncorrelated = remove_correlated(X_train, threshold=0.95)
print(f"After correlation filter: {X_uncorrelated.shape[1]} genes")

Evaluation Metrics That Matter

For classification:

Accuracy
Precision, recall, F1-score
ROC-AUC
PR-AUC (important for class imbalance)

For regression:

MAE
RMSE
R-squared

Critical Pitfalls

Overfitting

Model performs well on training data but poorly on unseen data.

Data Leakage

Information from test data accidentally influences training.

Examples:

Normalizing the full dataset before train-test split
Selecting features using all samples before cross-validation

Class Imbalance

If one class dominates, accuracy can be misleading.

Use:

Stratified split
Class weights
Balanced metrics (F1, PR-AUC)

Handling Class Imbalance

from sklearn.utils.class_weight import compute_class_weight
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.metrics import precision_recall_curve, average_precision_score

# Check class distribution
print(f"Class distribution: {np.bincount(y_train)}")
# Example: [950, 50] - highly imbalanced!

# Method 1: Class weights
class_weights = compute_class_weight("balanced", classes=np.unique(y_train), y=y_train)
weight_dict = dict(zip(np.unique(y_train), class_weights))

model = RandomForestClassifier(
    n_estimators=200,
    class_weight=weight_dict,  # or "balanced"
    random_state=42
)

# Method 2: SMOTE oversampling
pipeline = ImbPipeline([
    ("scaler", StandardScaler()),
    ("smote", SMOTE(random_state=42)),
    ("clf", LogisticRegression(max_iter=1000))
])
pipeline.fit(X_train, y_train)

# Method 3: Threshold tuning
y_proba = model.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)

# Find threshold that maximizes F1
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-10)
best_threshold = thresholds[np.argmax(f1_scores[:-1])]
print(f"Optimal threshold: {best_threshold:.3f}")

y_pred_tuned = (y_proba >= best_threshold).astype(int)

# Always plot PR curve for imbalanced data
plt.figure(figsize=(6, 6))
plt.plot(recall, precision)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title(f"PR Curve (AP = {average_precision_score(y_test, y_proba):.3f})")
plt.savefig("pr_curve.png", dpi=150)

Model Interpretation in Biology

Interpretability is important for translational use.

Useful tools:

Feature importance (tree-based models)
SHAP values
Coefficient inspection in linear models

Biological sanity check:

Do top features map to known pathways?
Are findings stable across cohorts?

SHAP Values for Explainable ML

import shap

# Train model
model = RandomForestClassifier(n_estimators=200, random_state=42)
model.fit(X_train, y_train)

# Create SHAP explainer
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Summary plot: global feature importance
shap.summary_plot(shap_values[1], X_test, plot_type="bar", max_display=20)
plt.savefig("shap_importance.png", dpi=150, bbox_inches="tight")

# Beeswarm plot: feature impact direction
shap.summary_plot(shap_values[1], X_test, max_display=20)
plt.savefig("shap_beeswarm.png", dpi=150, bbox_inches="tight")

# Individual prediction explanation
sample_idx = 0
shap.force_plot(
    explainer.expected_value[1],
    shap_values[1][sample_idx],
    X_test.iloc[sample_idx],
    matplotlib=True
)
plt.savefig("shap_individual.png", dpi=150, bbox_inches="tight")

# Dependence plot: one gene's effect
shap.dependence_plot("TP53", shap_values[1], X_test)

Pathway-Level Feature Aggregation

import gseapy as gp

# Get gene importance
importances = pd.Series(model.feature_importances_, index=X_train.columns)
ranked_genes = importances.sort_values(ascending=False)

# Run enrichment on top genes
top_genes = ranked_genes.head(100).index.tolist()

enrichment = gp.enrichr(
    gene_list=top_genes,
    gene_sets=["KEGG_2021_Human", "GO_Biological_Process_2021"],
    organism="human"
)

# View results
enrichment.results.head(20)

Minimal Python Example

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

# X: features (for example gene expression); y: labels
X = pd.read_csv("X_expression.csv")
y = pd.read_csv("y_labels.csv").values.ravel()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model = RandomForestClassifier(n_estimators=300, random_state=42)
model.fit(X_train, y_train)

pred = model.predict(X_test)
proba = model.predict_proba(X_test)[:, 1]

print(classification_report(y_test, pred))
print("ROC-AUC:", roc_auc_score(y_test, proba))

Deep Learning in Bioinformatics

Deep learning is useful when data is large and structure-rich.