Dimensional Reduction of Massive-Scale Perturbation Data Through Network-Based Activity Inference
A Technical Report on Multi-Level Data Processing in Atlas 3.0
Abstract
High-throughput perturbation screens generate massive gene expression datasets—often containing 60,000+ genes per experimental condition—that are noisy, high-dimensional, and difficult to interpret biologically. We present a systematic pipeline that reduces this complexity by ~99.6% through network-based inference of transcription factor (TF) and pathway activities. Applied to three major perturbation datasets (Tahoe 100M: 1.5 million single-cell activity scores; LINCS L1000: 720,216 drug signatures; SC-Perturb: 1.6 million CRISPR-edited cell scores), our approach transforms 204.5 million raw gene measurements into biologically interpretable TF and pathway activity scores using DoRothEA (31,953 TF-target relationships) and PROGENy (252,769 gene-pathway weights) networks. The resulting 204.5 million activity scores enable cross-platform validation, drug mechanism-of-action discovery, and pathway-level drug repurposing analyses while maintaining computational tractability. Dataset-specific normalization strategies—including plate-matched DMSO controls for Tahoe 100M and control cell deltas for single-cell perturbations—ensure biological validity across experimental platforms.
Keywords: dimensional reduction, transcription factor activity, pathway analysis, drug perturbations, single-cell CRISPR screens, DoRothEA, PROGENy
1. Introduction
1.1 The Challenge of High-Dimensional Perturbation Data
Modern perturbation screening technologies generate gene expression data at unprecedented scales. Single-cell RNA-sequencing of drug-treated or CRISPR-edited cells can measure 60,000+ genes across millions of cells, while platforms like LINCS L1000 have profiled 720,000+ drug treatment conditions. However, this data richness creates a fundamental analysis challenge: raw gene-level data is noisy, high-dimensional, and difficult to interpret mechanistically.
Extracting biological meaning from this scale requires:
- Dimensionality reduction to focus on relevant signals
- Noise reduction through averaging across gene sets
- Biological interpretability via established regulatory networks
- Cross-dataset comparability using consistent feature spaces
1.2 Network-Based Activity Inference
Our solution leverages two curated biological networks to transform gene-level data into interpretable regulatory activities:
DoRothEA (Database of Transcription factor Targets): A comprehensive resource of 31,953 high-confidence TF-target gene relationships across 242 transcription factors, curated from ChIP-seq, perturbation experiments, and literature (confidence levels A, B, C).
PROGENy (Pathway RespOnsive GENes): A collection of 252,769 gene-pathway footprint weights quantifying each gene's contribution to 14 canonical signaling pathways (EGFR, MAPK, PI3K, p53, TGFβ, TNFα, Trail, VEGF, Hypoxia, JAK-STAT, Androgen, Estrogen, WNT, NFκB).
By projecting gene expression signatures onto these networks using the Univariate Linear Model (ULM) from the decoupler framework, we achieve:
- 99.6% dimensionality reduction: 60,000 genes → 242 TF activities
- 99.98% dimensionality reduction: 60,000 genes → 14 pathway activities
- Biological interpretability: "MAPK pathway activated" vs. "gene X upregulated"
- Noise reduction: Each score averages 10-300 genes
- Cross-platform consistency: Same TFs/pathways across all datasets
1.3 Multi-Level Data Products
The BioAtlas processing pipeline generates two complementary data products optimized for different analytical workflows:
Gene-level signatures: Preserve individual gene responses with directionality and statistical significance. For TAHO-100M, the top-200 differentially expressed genes per drug are stored in tahoe_drug_signature (74,885 drug-gene pairs). These enable gene-specific reversal scoring in drug discovery workflows.
Activity scores: Aggregate gene-level data to regulatory features (TFs and pathways) via network-based inference, stored in tahoe_activity (1.55 million scores across ~135,000 contexts). These dimensionally-reduced scores enable mechanistic interpretation and cross-platform validation.
2. Methods
2.1 Core Algorithm: Decoupler Univariate Linear Model (ULM)
For each TF or pathway, we compute an activity score using:
activity_score = Σ(gene_expression × regulation_sign) / √n_targets
where:
gene_expression: log2FC, z-score, or normalized expressionregulation_sign: +1 (activation) or -1 (repression) from networkn_targets: number of target genes in the regulon/pathway
The division by √n_targets normalizes scores across features with different numbers of targets, enabling fair comparison between TFs with 10 targets vs. 300 targets.
2.2 Dataset-Specific Processing Pipelines
2.2.1 Tahoe 100M: Pseudobulk Single-Cell Drug Perturbations
Source: 1,025 parquet files containing differential expression results from 100 million single cells treated with 379 drugs across 50 cell lines.
Processing steps:
ID Mapping (85% gene coverage):
- Gene symbols → Ensembl IDs via bio-kg gene table
- Drug names → ChEMBL IDs (379 drugs mapped)
- Cell lines → DepMap IDs (50 cell lines, 100% coverage)
Network Filtering:
- Retain only genes present in DoRothEA or PROGENy networks
- Reduces memory footprint from 62,710 to ~6,000 genes per context
DMSO Normalization:
- Strategy 1: Exact match (plate + cell + time + feature)
- Strategy 2: Partial match (plate + feature only)
- Strategy 3: Raw score if no DMSO (flagged NO_CONTROL)
- Result: ~60% of scores are DMSO-normalized, removing plate-level batch effects.
ULM Scoring:
- Apply decoupler ULM for DoRothEA (242 TFs) and PROGENy (12 pathways)
Output: 1,550,866 high-quality activities (242 TFs, 12 pathways).
2.2.2 LINCS L1000: Landmark Gene Drug Signatures
Source: 720,216 drug treatment signatures from the LINCS Consortium.
Processing steps:
Z-Score Input:
- LINCS Consortium pre-normalizes to z-scores
- z = (expression - median_vehicle) / MAD_vehicle
ULM Scoring:
- Apply decoupler ULM for DoRothEA (267 TFs) and PROGENy (14 pathways)
Complete Scoring:
- Store ALL TF and pathway scores (no filtering)
Output: 202,282,258 activities (192M TF scores + 10M pathway scores).
2.2.3 SC-Perturb: Single-Cell CRISPR Screens
Source: 4 datasets totaling 983,954 cells with 207,937 genetic perturbations.
Processing steps:
Control Normalization:
delta_expression = mean(perturbed_cells) - mean(control_cells)- Control cells: non-targeting guides within same dataset
ULM Scoring:
- Apply decoupler ULM to delta_expression
Output: 1,640,473 activities across 207,937 perturbations.
2.3 Additional Data Normalization
2.3.1 pChEMBL Potency Standardization
Problem: Heterogeneous units (Ki, IC50, Kd, EC50).
Solution: Unified pChEMBL normalization (-log10(molar)).
Output: 1.35M standardized potency measurements.
2.3.2 GWAS Variant Harmonization
Pipeline:
- QC filters: MAF > 0.01, INFO > 0.8, p < 5×10⁻⁸
- Coordinate standardization to GRCh38
- Allele harmonization Output: 160M+ harmonized variants across 443K studies.
3. Results
3.1 Dimensionality Reduction Achieved
| Dataset | Input Dimensions | Output TFs | Output Pathways | Reduction Factor |
|---|---|---|---|---|
| Tahoe 100M | 62,710 genes | 242 | 12 | 246× |
| LINCS L1000 | 978 genes | 267 | 14 | 3.5× |
| SC-Perturb | 8K-23K genes | 242 | 14 | 60-95× |
Overall: From billions of gene measurements to 204.5 million interpretable activity scores.
3.2 Coverage Statistics
| Metric | Tahoe 100M | LINCS L1000 | SC-Perturb | Total |
|---|---|---|---|---|
| Experimental conditions | ~135,000 | 720,216 | 207,937 | ~1M |
| Total activities | 1,550,866 | 202,282,258 | 1,640,473 | 205M+ |
| Unique drugs/perturbations | 379 | 33,609 | 5,998 genes | ~40K |
| Cell lines/types | 50 | 230 | 4 | 284 |
4. Applications Enabled
4.1 Drug Mechanism-of-Action Discovery
Example: Paclitaxel consistently activates the Trail pathway (apoptosis) across multiple cell lines in Tahoe-100M, consistent with its microtubule-disrupting mechanism.
4.2 Cross-Platform Validation
Shared drugs show consistent sign direction for pathway activities across Tahoe and LINCS L1000, enabling high-confidence biomarker discovery.
4.3 TF-Drug-Disease Links
By joining TF activities with disease-gene associations, we can identify drugs that modulate disease-specific transcription factors (e.g., finding drugs that activate TP53 in p53-deficient cancers).
5. Conclusion
We present a comprehensive data processing framework that transforms massive-scale biological data into interpretable, analysis-ready resources. The framework encompasses:
- Dimensional reduction: 99.6% reduction of gene expression data via network-based activity inference.
- Potency standardization: Unified pChEMBL normalization of 1.35M drug-target measurements.
- Genetic data harmonization: QC-filtered processing of 160M+ GWAS variants and 414K tissue-specific eQTL leads.
- Dataset-specific normalization: Custom QC strategies appropriate to each experimental platform.
The result is BioAtlas: a 490M+ row knowledge base that bridges the gap between raw high-throughput measurements and actionable biological knowledge.