Multi-Omics Data Integration

Multi-Omics Data Integration banner

Multi-omics integration combines information from multiple molecular layers to provide a more complete view of biology than any single data type alone.

Learning Goals

By the end of this chapter, you should be able to:

Explain why multi-omics integration is useful in translational research.
Design a clean end-to-end integration workflow from raw data to biological interpretation.
Compare early, intermediate, and late integration strategies.
Apply practical R/Python code patterns for integration-ready datasets.
Identify key pitfalls such as batch effects, feature mismatch, and leakage.

Why Multi-Omics?

Single-omics analysis answers narrow questions. Multi-omics helps answer system-level questions:

Why does a genomic variant change RNA, protein, and metabolite profiles differently?
Which molecular layer is most predictive of phenotype or clinical outcome?
Which pathways show consistent dysregulation across layers?

Common layers:

Genomics (variants, CNVs)
Transcriptomics (gene expression, isoforms)
Proteomics (protein abundance, PTMs)
Metabolomics (small molecules, pathway activity)
Epigenomics (methylation, chromatin state)

Core Integration Strategies

Multi-omics Integration Strategies

1. Early Integration (Feature-Level)

Concatenate features from all layers into one matrix.

Use when:

Sample size is high enough.
Features are harmonized and scaled.

Risk: High dimensionality can overfit quickly.

2. Intermediate Integration (Latent-Level)

Learn compact representations per omics layer, then combine the latent factors.

Use when:

Each omics type has very different feature spaces.
You want biologically meaningful hidden factors.

Common tools: MOFA+, iCluster, DIABLO (mixOmics), autoencoders.

3. Late Integration (Decision-Level)

Build models per omics layer and combine predictions.

Use when:

Layers have missing samples.
Separate models are easier to interpret operationally.

End-to-End Workflow

Step 1. Define the Biological Question

Examples:

Biomarker discovery for disease subtypes.
Mechanistic pathway prioritization.
Outcome prediction (response/non-response).

Step 2. Harmonize Samples and IDs

Create a shared sample map across omics files.

# Example in R: enforce common samples across omics matrices
common_ids <- Reduce(intersect, list(colnames(rna), colnames(protein), colnames(metabolite)))
rna_common <- rna[, common_ids]
protein_common <- protein[, common_ids]
met_common <- metabolite[, common_ids]

Step 3. Preprocess Each Layer Independently

Minimum checklist:

Normalize within each omics type.
Remove low-quality features.
Handle missing values with method-appropriate imputation.
Correct batch effects.

# RNA-seq example pattern: log transform after size normalization
rna_log <- log2(rna_common + 1)

# Proteomics example pattern: median center per sample
sample_medians <- apply(protein_common, 2, median, na.rm = TRUE)
global_median <- median(sample_medians, na.rm = TRUE)
protein_norm <- sweep(protein_common, 2, sample_medians - global_median, FUN = "-")

Step 4. Align Features to Biological Knowledge

Link features to gene symbols, pathways, or protein complexes.

# Python pattern: align by shared gene symbol index
import pandas as pd

rna_df = pd.read_csv("rna_matrix.csv", index_col=0)
protein_df = pd.read_csv("protein_matrix.csv", index_col=0)
shared = rna_df.index.intersection(protein_df.index)
rna_shared = rna_df.loc[shared]
protein_shared = protein_df.loc[shared]

Step 5. Choose an Integration Model

Start with one interpretable baseline and one advanced model:

Baseline: correlation network + pathway enrichment.
Advanced: latent-factor model (for example MOFA+).

Step 6. Validate Robustly

Split train/test at the sample level.
Prevent leakage from normalization/imputation across split boundaries.
Report external validation if possible.

Integration Methods You Should Know

Correlation-Based Integration

Simple and interpretable.
Good for first-pass biological exploration.

Canonical Correlation Analysis (CCA)

Finds linear relationships between two omics blocks.
Useful for paired datasets with enough samples.

Partial Least Squares / DIABLO (mixOmics)

Supervised integration with feature selection.
Strong option for multi-class biomarker tasks.

Multi-Omics Factor Analysis (MOFA+)

Unsupervised latent factors.
Separates shared vs layer-specific variation.

Complete Integration Tutorials

MOFA+ Tutorial (R/Python)

MOFA+ discovers latent factors capturing variance shared or specific to each omics layer.

# Install MOFA+
# devtools::install_github("bioFAM/MOFA2")
library(MOFA2)

# Prepare data as named list
data_list <- list(
  RNA = as.matrix(rna_common),
  Protein = as.matrix(protein_common),
  Methylation = as.matrix(meth_common)
)

# Create MOFA object
mofa_obj <- create_mofa(data_list)

# Set training options
data_opts <- get_default_data_options(mofa_obj)
model_opts <- get_default_model_options(mofa_obj)
model_opts$num_factors <- 15  # Number of latent factors

train_opts <- get_default_training_options(mofa_obj)
train_opts$convergence_mode <- "slow"
train_opts$seed <- 42

# Prepare and train
mofa_obj <- prepare_mofa(
  mofa_obj,
  data_options = data_opts,
  model_options = model_opts,
  training_options = train_opts
)

mofa_obj <- run_mofa(mofa_obj, outfile = "mofa_model.hdf5")

# Explore results
# Variance explained by each factor per view
plot_variance_explained(mofa_obj, max_r2 = 25)

# Factor values (latent representation of samples)
factors <- get_factors(mofa_obj, as.data.frame = TRUE)

# Plot samples in factor space colored by phenotype
plot_factor(mofa_obj, factor = 1, color_by = "condition")

# Get top features for each factor
weights <- get_weights(mofa_obj, views = "all", factors = 1:5, as.data.frame = TRUE)
top_genes <- weights %>%
  filter(view == "RNA") %>%
  group_by(factor) %>%
  slice_max(abs(value), n = 20)

# Pathway enrichment on top factor genes
library(gprofiler2)
enrichment <- gost(query = top_genes$feature, organism = "hsapiens")

mixOmics DIABLO Tutorial (Supervised Multi-Omics)

DIABLO finds sparse multi-omics signatures that discriminate groups.

library(mixOmics)

# Prepare matrices (samples x features)
X_list <- list(
  mRNA = t(rna_common),
  miRNA = t(mirna_common),
  Protein = t(protein_common)
)

# Outcome variable
Y <- as.factor(metadata$condition)

# Design matrix (0 = no correlation, 1 = full correlation between blocks)
design <- matrix(0.1, ncol = length(X_list), nrow = length(X_list),
                 dimnames = list(names(X_list), names(X_list)))
diag(design) <- 0

# Tune number of components and variables
tune_splsda <- tune.block.splsda(
  X = X_list,
  Y = Y,
  design = design,
  ncomp = 5,
  test.keepX = list(
    mRNA = c(10, 30, 50),
    miRNA = c(5, 10, 20),
    Protein = c(10, 30, 50)
  ),
  BPPARAM = BiocParallel::MulticoreParam(workers = 4)
)

# Extract optimal parameters
optimal_keepX <- tune_splsda$choice.keepX

# Train final model
diablo_model <- block.splsda(
  X = X_list,
  Y = Y,
  ncomp = tune_splsda$choice.ncomp$ncomp,
  keepX = optimal_keepX,
  design = design
)

# Visualize results
# Sample plot
plotIndiv(diablo_model, ind.names = FALSE, legend = TRUE,
          title = "DIABLO: Sample Space")

# Arrow plot (connect sample projections across omics)
plotArrow(diablo_model, ind.names = FALSE, legend = TRUE)

# Circos plot (correlations between selected features)
circosPlot(diablo_model, cutoff = 0.7, line = TRUE,
           color.blocks = c("darkorchid", "brown1", "lightgreen"))

# Get selected features
selected_features <- selectVar(diablo_model)
# mRNA: selected_features$mRNA$name
# Protein: selected_features$Protein$name

# Cross-validation performance
perf_diablo <- perf(diablo_model, validation = "Mfold", 
                     folds = 5, nrepeat = 10)
plot(perf_diablo)

Similarity Network Fusion (SNF)

SNF combines patient similarity networks from different omics into a unified network.

library(SNFtool)

# Calculate distance matrices for each omics
dist_rna <- dist2(as.matrix(t(rna_common)), as.matrix(t(rna_common)))
dist_protein <- dist2(as.matrix(t(protein_common)), as.matrix(t(protein_common)))
dist_meth <- dist2(as.matrix(t(meth_common)), as.matrix(t(meth_common)))

# Convert to affinity matrices
K <- 20  # Number of neighbors
alpha <- 0.5  # Hyperparameter for kernel

aff_rna <- affinityMatrix(dist_rna, K, alpha)
aff_protein <- affinityMatrix(dist_protein, K, alpha)
aff_meth <- affinityMatrix(dist_meth, K, alpha)

# Fuse networks
fused_network <- SNF(list(aff_rna, aff_protein, aff_meth), K, 20)

# Cluster patients using spectral clustering
num_clusters <- estimateNumberOfClustersGivenGraph(fused_network, NUMC = 2:5)
clusters <- spectralClustering(fused_network, K = num_clusters$`Eigen-gap best`)

# Visualize fused network
displayClusters(fused_network, clusters)

# Compare with individual omics clustering
cluster_rna <- spectralClustering(aff_rna, K = num_clusters$`Eigen-gap best`)
table(clusters, cluster_rna)  # Agreement between fused and RNA-only

Network-Based Integration with WGCNA

library(WGCNA)
allowWGCNAThreads()

# Transpose matrices (samples x genes)
datExpr_rna <- t(rna_common)
datExpr_protein <- t(protein_common)

# Build RNA co-expression network
softPower_rna <- pickSoftThreshold(datExpr_rna, verbose = 5)$powerEstimate
adjacency_rna <- adjacency(datExpr_rna, power = softPower_rna)
TOM_rna <- TOMsimilarity(adjacency_rna)
geneTree_rna <- hclust(as.dist(1 - TOM_rna), method = "average")

# Detect modules
modules_rna <- cutreeDynamic(dendro = geneTree_rna, distM = 1 - TOM_rna,
                              deepSplit = 2, minClusterSize = 30)

# Calculate module eigengenes
MEList_rna <- moduleEigengenes(datExpr_rna, colors = modules_rna)
MEs_rna <- MEList_rna$eigengenes

# Correlate RNA modules with protein expression
moduleTraitCor <- cor(MEs_rna, datExpr_protein, use = "p")
moduleTraitPvalue <- corPvalueStudent(moduleTraitCor, nrow(datExpr_rna))

# Visualize module-protein correlations
library(pheatmap)
pheatmap(moduleTraitCor[, 1:50],  # Top 50 proteins
         clustering_method = "ward.D2",
         main = "RNA Module - Protein Correlations")

Practical Quality Control for Multi-Omics

Plot missingness by layer and sample.
Check per-layer variance and outlier samples.
Visualize batch structure with PCA/UMAP before and after correction.
Inspect cross-omics concordance for known biology (for example gene-protein pairs).

# Quick missingness check (R)
miss_sample <- colMeans(is.na(protein_norm)) * 100
summary(miss_sample)

Common Pitfalls and How to Avoid Them

Feature mismatch across layers: map IDs early and document conversion steps.
Overfitting in high dimensions: use regularization and nested validation.
Ignoring batch effects: correct per layer, then verify correction worked.
Blind concatenation: scale/transform before combining matrices.
Data leakage: preprocess inside training folds only.

Mini Integration Example (Conceptual)

# Assume rna_log, protein_norm, met_common are feature x sample
# 1) Keep common samples
ids <- Reduce(intersect, list(colnames(rna_log), colnames(protein_norm), colnames(met_common)))

X_rna <- t(scale(t(rna_log[, ids])))
X_pro <- t(scale(t(protein_norm[, ids])))
X_met <- t(scale(t(met_common[, ids])))

# 2) Simple early integration baseline
X <- rbind(X_rna, X_pro, X_met)

# 3) Dimensionality reduction and clustering
pca <- prcomp(t(X), center = TRUE, scale. = FALSE)
plot(pca$x[, 1], pca$x[, 2], pch = 19, xlab = "PC1", ylab = "PC2")

Reproducibility Checklist

Version-control all preprocessing and integration scripts.
Save parameter files for normalization, imputation, and model tuning.
Keep a frozen metadata file with sample-to-omics mapping.
Export QC reports and intermediate matrices for auditability.
Track package versions (R sessionInfo(), Python pip freeze).

Summary

Multi-omics integration turns disconnected molecular measurements into systems-level biological insight. Strong results depend on disciplined preprocessing, careful feature/sample alignment, robust validation, and clear biological interpretation.

💬 Give Feedback

Help us improve! Share what worked, what was unclear, or suggest new topics.

Share Your Feedback

Multi-Omics Data Integration

Learning Goals

Why Multi-Omics?

Core Integration Strategies

1. Early Integration (Feature-Level)

2. Intermediate Integration (Latent-Level)

3. Late Integration (Decision-Level)

End-to-End Workflow

Step 1. Define the Biological Question

Step 2. Harmonize Samples and IDs

Step 3. Preprocess Each Layer Independently

Step 4. Align Features to Biological Knowledge

Step 5. Choose an Integration Model

Step 6. Validate Robustly

Integration Methods You Should Know

Correlation-Based Integration

Canonical Correlation Analysis (CCA)

Partial Least Squares / DIABLO (mixOmics)

Multi-Omics Factor Analysis (MOFA+)

Complete Integration Tutorials

MOFA+ Tutorial (R/Python)

mixOmics DIABLO Tutorial (Supervised Multi-Omics)

Similarity Network Fusion (SNF)

Network-Based Integration with WGCNA

Practical Quality Control for Multi-Omics

Common Pitfalls and How to Avoid Them

Mini Integration Example (Conceptual)

Reproducibility Checklist

Summary

Share This

💬 Give Feedback