Dimensional Reduction of Massive-Scale Perturbation Data Through Network-Based Activity Inference

Community Article Published December 9, 2025

A Technical Report on Multi-Level Data Processing in Atlas 3.0


Abstract

High-throughput perturbation screens generate massive gene expression datasets—often containing 60,000+ genes per experimental condition—that are noisy, high-dimensional, and difficult to interpret biologically. We present a systematic pipeline that reduces this complexity by ~99.6% through network-based inference of transcription factor (TF) and pathway activities. Applied to three major perturbation datasets (Tahoe 100M: 1.5 million single-cell activity scores; LINCS L1000: 720,216 drug signatures; SC-Perturb: 1.6 million CRISPR-edited cell scores), our approach transforms 204.5 million raw gene measurements into biologically interpretable TF and pathway activity scores using DoRothEA (31,953 TF-target relationships) and PROGENy (252,769 gene-pathway weights) networks. The resulting 204.5 million activity scores enable cross-platform validation, drug mechanism-of-action discovery, and pathway-level drug repurposing analyses while maintaining computational tractability. Dataset-specific normalization strategies—including plate-matched DMSO controls for Tahoe 100M and control cell deltas for single-cell perturbations—ensure biological validity across experimental platforms.

Keywords: dimensional reduction, transcription factor activity, pathway analysis, drug perturbations, single-cell CRISPR screens, DoRothEA, PROGENy


1. Introduction

1.1 The Challenge of High-Dimensional Perturbation Data

Modern perturbation screening technologies generate gene expression data at unprecedented scales. Single-cell RNA-sequencing of drug-treated or CRISPR-edited cells can measure 60,000+ genes across millions of cells, while platforms like LINCS L1000 have profiled 720,000+ drug treatment conditions. However, this data richness creates a fundamental analysis challenge: raw gene-level data is noisy, high-dimensional, and difficult to interpret mechanistically.

Extracting biological meaning from this scale requires:

  1. Dimensionality reduction to focus on relevant signals
  2. Noise reduction through averaging across gene sets
  3. Biological interpretability via established regulatory networks
  4. Cross-dataset comparability using consistent feature spaces

1.2 Network-Based Activity Inference

Our solution leverages two curated biological networks to transform gene-level data into interpretable regulatory activities:

DoRothEA (Database of Transcription factor Targets): A comprehensive resource of 31,953 high-confidence TF-target gene relationships across 242 transcription factors, curated from ChIP-seq, perturbation experiments, and literature (confidence levels A, B, C).

PROGENy (Pathway RespOnsive GENes): A collection of 252,769 gene-pathway footprint weights quantifying each gene's contribution to 14 canonical signaling pathways (EGFR, MAPK, PI3K, p53, TGFβ, TNFα, Trail, VEGF, Hypoxia, JAK-STAT, Androgen, Estrogen, WNT, NFκB).

By projecting gene expression signatures onto these networks using the Univariate Linear Model (ULM) from the decoupler framework, we achieve:

  • 99.6% dimensionality reduction: 60,000 genes → 242 TF activities
  • 99.98% dimensionality reduction: 60,000 genes → 14 pathway activities
  • Biological interpretability: "MAPK pathway activated" vs. "gene X upregulated"
  • Noise reduction: Each score averages 10-300 genes
  • Cross-platform consistency: Same TFs/pathways across all datasets

1.3 Multi-Level Data Products

The BioAtlas processing pipeline generates two complementary data products optimized for different analytical workflows:

Gene-level signatures: Preserve individual gene responses with directionality and statistical significance. For TAHO-100M, the top-200 differentially expressed genes per drug are stored in tahoe_drug_signature (74,885 drug-gene pairs). These enable gene-specific reversal scoring in drug discovery workflows.

Activity scores: Aggregate gene-level data to regulatory features (TFs and pathways) via network-based inference, stored in tahoe_activity (1.55 million scores across ~135,000 contexts). These dimensionally-reduced scores enable mechanistic interpretation and cross-platform validation.


2. Methods

2.1 Core Algorithm: Decoupler Univariate Linear Model (ULM)

For each TF or pathway, we compute an activity score using:

activity_score = Σ(gene_expression × regulation_sign) / √n_targets

where:

  • gene_expression: log2FC, z-score, or normalized expression
  • regulation_sign: +1 (activation) or -1 (repression) from network
  • n_targets: number of target genes in the regulon/pathway

The division by √n_targets normalizes scores across features with different numbers of targets, enabling fair comparison between TFs with 10 targets vs. 300 targets.

2.2 Dataset-Specific Processing Pipelines

2.2.1 Tahoe 100M: Pseudobulk Single-Cell Drug Perturbations

Source: 1,025 parquet files containing differential expression results from 100 million single cells treated with 379 drugs across 50 cell lines.

Processing steps:

  1. ID Mapping (85% gene coverage):

    • Gene symbols → Ensembl IDs via bio-kg gene table
    • Drug names → ChEMBL IDs (379 drugs mapped)
    • Cell lines → DepMap IDs (50 cell lines, 100% coverage)
  2. Network Filtering:

    • Retain only genes present in DoRothEA or PROGENy networks
    • Reduces memory footprint from 62,710 to ~6,000 genes per context
  3. DMSO Normalization:

    • Strategy 1: Exact match (plate + cell + time + feature)
    • Strategy 2: Partial match (plate + feature only)
    • Strategy 3: Raw score if no DMSO (flagged NO_CONTROL)
    • Result: ~60% of scores are DMSO-normalized, removing plate-level batch effects.
  4. ULM Scoring:

    • Apply decoupler ULM for DoRothEA (242 TFs) and PROGENy (12 pathways)

Output: 1,550,866 high-quality activities (242 TFs, 12 pathways).

2.2.2 LINCS L1000: Landmark Gene Drug Signatures

Source: 720,216 drug treatment signatures from the LINCS Consortium.

Processing steps:

  1. Z-Score Input:

    • LINCS Consortium pre-normalizes to z-scores
    • z = (expression - median_vehicle) / MAD_vehicle
  2. ULM Scoring:

    • Apply decoupler ULM for DoRothEA (267 TFs) and PROGENy (14 pathways)
  3. Complete Scoring:

    • Store ALL TF and pathway scores (no filtering)

Output: 202,282,258 activities (192M TF scores + 10M pathway scores).

2.2.3 SC-Perturb: Single-Cell CRISPR Screens

Source: 4 datasets totaling 983,954 cells with 207,937 genetic perturbations.

Processing steps:

  1. Control Normalization:

    • delta_expression = mean(perturbed_cells) - mean(control_cells)
    • Control cells: non-targeting guides within same dataset
  2. ULM Scoring:

    • Apply decoupler ULM to delta_expression

Output: 1,640,473 activities across 207,937 perturbations.

2.3 Additional Data Normalization

2.3.1 pChEMBL Potency Standardization

Problem: Heterogeneous units (Ki, IC50, Kd, EC50).
Solution: Unified pChEMBL normalization (-log10(molar)).
Output: 1.35M standardized potency measurements.

2.3.2 GWAS Variant Harmonization

Pipeline:

  1. QC filters: MAF > 0.01, INFO > 0.8, p < 5×10⁻⁸
  2. Coordinate standardization to GRCh38
  3. Allele harmonization Output: 160M+ harmonized variants across 443K studies.

3. Results

3.1 Dimensionality Reduction Achieved

Dataset Input Dimensions Output TFs Output Pathways Reduction Factor
Tahoe 100M 62,710 genes 242 12 246×
LINCS L1000 978 genes 267 14 3.5×
SC-Perturb 8K-23K genes 242 14 60-95×

Overall: From billions of gene measurements to 204.5 million interpretable activity scores.

3.2 Coverage Statistics

Metric Tahoe 100M LINCS L1000 SC-Perturb Total
Experimental conditions ~135,000 720,216 207,937 ~1M
Total activities 1,550,866 202,282,258 1,640,473 205M+
Unique drugs/perturbations 379 33,609 5,998 genes ~40K
Cell lines/types 50 230 4 284

4. Applications Enabled

4.1 Drug Mechanism-of-Action Discovery

Example: Paclitaxel consistently activates the Trail pathway (apoptosis) across multiple cell lines in Tahoe-100M, consistent with its microtubule-disrupting mechanism.

4.2 Cross-Platform Validation

Shared drugs show consistent sign direction for pathway activities across Tahoe and LINCS L1000, enabling high-confidence biomarker discovery.

4.3 TF-Drug-Disease Links

By joining TF activities with disease-gene associations, we can identify drugs that modulate disease-specific transcription factors (e.g., finding drugs that activate TP53 in p53-deficient cancers).


5. Conclusion

We present a comprehensive data processing framework that transforms massive-scale biological data into interpretable, analysis-ready resources. The framework encompasses:

  1. Dimensional reduction: 99.6% reduction of gene expression data via network-based activity inference.
  2. Potency standardization: Unified pChEMBL normalization of 1.35M drug-target measurements.
  3. Genetic data harmonization: QC-filtered processing of 160M+ GWAS variants and 414K tissue-specific eQTL leads.
  4. Dataset-specific normalization: Custom QC strategies appropriate to each experimental platform.

The result is BioAtlas: a 490M+ row knowledge base that bridges the gap between raw high-throughput measurements and actionable biological knowledge.

Community

Sign up or log in to comment