Skip to content

Zaoqu-Liu/CellProgramMapper

Repository files navigation

CellProgramMapper

📖 Documentation: https://zaoqu-liu.github.io/CellProgramMapper/

R-universe R-CMD-check Codecov License: MIT

Overview

CellProgramMapper is a high-performance R package for projecting single-cell transcriptomic data onto reference gene expression programs (GEPs). The package implements non-negative matrix factorization (NMF)-based methods for systematic characterization of cellular transcriptional states.

Methodology

Mathematical Framework

Given a query expression matrix X ∈ ℝn×p (n cells × p genes) and a reference spectra matrix H ∈ ℝk×p (k programs × p genes), CellProgramMapper estimates the usage matrix W ∈ ℝn×k by solving:

$$\min_{W \geq 0} |X - WH|_F^2$$

For each cell i, this decomposes into independent Non-Negative Least Squares (NNLS) subproblems:

$$\min_{w_i \geq 0} |x_i - H^\top w_i|_2^2$$

Implementation

Two NNLS solvers are provided:

Method Algorithm Reference
Coordinate Descent Sequential coordinate-wise optimization Franc et al. (2005)
Active Set Lawson-Hanson algorithm Lawson & Hanson (1974)

The coordinate descent method is generally faster for typical problem sizes, while the active set method provides guaranteed finite convergence.

Preprocessing

Input data undergoes standardization by scaling each gene by its population standard deviation (without centering):

$$x'_j = \frac{x_j}{\sigma_j}, \quad \sigma_j = \sqrt{\frac{1}{n}\sum_{i=1}^n (x_{ij} - \bar{x}_j)^2}$$

This matches the preprocessing in sklearn.preprocessing.scale(X, with_mean=False).

Key Features

CellProgramMapper provides a complete pure R solution for NMF-based cell annotation:

Feature Implementation Details
Core NMF usage fitting C++ NNLS Coordinate descent and active set algorithms
Data preprocessing C++ implementation Population standard deviation scaling
Score computation R matrix ops Weighted sum and threshold-based scores
Cell type prediction Built-in ML models Multinomial logistic regression
Reference management curl download Auto-caching with version control
NPZ file reading Native R + reticulate Full NumPy format support
Consensus reference building R + C++ Multi-dataset GEP clustering

Built-in Machine Learning Models: The package includes pre-trained model parameters for cell type prediction, enabling accurate predictions without external dependencies.

# List available built-in models
list_builtin_models()

# Get T-cell lineage predictions (matches Python exactly)
labels <- predict_lineage(usage_norm, "TCAT.V1")

# Get probability distribution for each class
probs <- get_lineage_probabilities(usage_norm, "TCAT.V1")

Installation

From R-universe (Recommended)

install.packages("CellProgramMapper", 
                 repos = "https://zaoqu-liu.r-universe.dev")

From GitHub

# install.packages("remotes")
remotes::install_github("Zaoqu-Liu/CellProgramMapper")

Dependencies

Required:

  • R (≥ 4.0.0)
  • Rcpp, RcppArmadillo
  • Matrix, data.table
  • curl, yaml, rappdirs
  • future, future.apply

Optional:

  • Seurat/SeuratObject (for Seurat integration)
  • hdf5r, anndata (for h5ad file support)
  • reticulate (for reading NPZ files with object arrays)

Quick Start

library(CellProgramMapper)

# Map cells to reference gene expression programs
result <- CellProgramMapper(
    query = seurat_obj,        # Seurat object, matrix, or file path
    reference = "TCAT.V1",     # Pre-built reference or custom file
    method = "cd",             # "cd" (coordinate descent) or "active_set"
    verbose = TRUE
)

# Access results
usage_matrix <- result$usage_norm   # Normalized usage (rows sum to 1)
scores <- result$scores             # Computed add-on scores

# Integration with Seurat
seurat_obj <- add_results_to_seurat(seurat_obj, result)

Available References

# List pre-built references
available_references()

Building Custom References

Construct consensus GEPs from multiple cNMF analyses:

consensus <- BuildConsensusReference(
    cnmf_paths = c("path/to/cnmf1", "path/to/cnmf2"),
    ks = c(10, 15),
    density_thresholds = c(0.1, 0.1),
    output_dir = "./consensus_output",
    corr_thresh = 0.5
)

Performance

CellProgramMapper is optimized for computational efficiency:

  • C++ Backend: Core NNLS solvers implemented in C++ via RcppArmadillo
  • Sparse Matrix Support: Native handling of sparse matrices
  • Parallel Processing: Optional parallelization via future framework
  • Batch Processing: Memory-efficient processing of large datasets

Output Structure

The CellProgramMapper() function returns a CellProgramMapperResult object containing:

Field Description
usage Raw usage matrix (cells × programs)
usage_norm Normalized usage matrix (rows sum to 1)
scores Computed add-on scores
overlap_genes Genes used for mapping
n_cells Number of cells processed
n_programs Number of programs

Documentation

Detailed documentation and tutorials are available at:

References

  1. Lawson CL, Hanson RJ (1974). Solving Least Squares Problems. Prentice-Hall.
  2. Franc V, Hlavac V, Navara M (2005). Sequential Coordinate-Wise Algorithm for the Non-negative Least Squares Problem. CAIP 2005.
  3. Lee DD, Seung HS (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401:788-791.

Citation

If you use CellProgramMapper in your research, please cite:

@software{CellProgramMapper,
  author = {Liu, Zaoqu},
  title = {CellProgramMapper: Projection of Single-Cell Data onto Reference Gene Expression Programs},
  year = {2026},
  url = {https://github.com/Zaoqu-Liu/CellProgramMapper}
}

License

MIT License © 2026 Zaoqu Liu

Contact

About

Map Single Cells to Reference Gene Expression Programs

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors