📖 Documentation: https://zaoqu-liu.github.io/CellProgramMapper/
CellProgramMapper is a high-performance R package for projecting single-cell transcriptomic data onto reference gene expression programs (GEPs). The package implements non-negative matrix factorization (NMF)-based methods for systematic characterization of cellular transcriptional states.
Given a query expression matrix X ∈ ℝn×p (n cells × p genes) and a reference spectra matrix H ∈ ℝk×p (k programs × p genes), CellProgramMapper estimates the usage matrix W ∈ ℝn×k by solving:
For each cell i, this decomposes into independent Non-Negative Least Squares (NNLS) subproblems:
Two NNLS solvers are provided:
| Method | Algorithm | Reference |
|---|---|---|
| Coordinate Descent | Sequential coordinate-wise optimization | Franc et al. (2005) |
| Active Set | Lawson-Hanson algorithm | Lawson & Hanson (1974) |
The coordinate descent method is generally faster for typical problem sizes, while the active set method provides guaranteed finite convergence.
Input data undergoes standardization by scaling each gene by its population standard deviation (without centering):
This matches the preprocessing in sklearn.preprocessing.scale(X, with_mean=False).
CellProgramMapper provides a complete pure R solution for NMF-based cell annotation:
| Feature | Implementation | Details |
|---|---|---|
| Core NMF usage fitting | C++ NNLS | Coordinate descent and active set algorithms |
| Data preprocessing | C++ implementation | Population standard deviation scaling |
| Score computation | R matrix ops | Weighted sum and threshold-based scores |
| Cell type prediction | Built-in ML models | Multinomial logistic regression |
| Reference management | curl download | Auto-caching with version control |
| NPZ file reading | Native R + reticulate | Full NumPy format support |
| Consensus reference building | R + C++ | Multi-dataset GEP clustering |
Built-in Machine Learning Models: The package includes pre-trained model parameters for cell type prediction, enabling accurate predictions without external dependencies.
# List available built-in models
list_builtin_models()
# Get T-cell lineage predictions (matches Python exactly)
labels <- predict_lineage(usage_norm, "TCAT.V1")
# Get probability distribution for each class
probs <- get_lineage_probabilities(usage_norm, "TCAT.V1")install.packages("CellProgramMapper",
repos = "https://zaoqu-liu.r-universe.dev")# install.packages("remotes")
remotes::install_github("Zaoqu-Liu/CellProgramMapper")Required:
- R (≥ 4.0.0)
- Rcpp, RcppArmadillo
- Matrix, data.table
- curl, yaml, rappdirs
- future, future.apply
Optional:
- Seurat/SeuratObject (for Seurat integration)
- hdf5r, anndata (for h5ad file support)
- reticulate (for reading NPZ files with object arrays)
library(CellProgramMapper)
# Map cells to reference gene expression programs
result <- CellProgramMapper(
query = seurat_obj, # Seurat object, matrix, or file path
reference = "TCAT.V1", # Pre-built reference or custom file
method = "cd", # "cd" (coordinate descent) or "active_set"
verbose = TRUE
)
# Access results
usage_matrix <- result$usage_norm # Normalized usage (rows sum to 1)
scores <- result$scores # Computed add-on scores
# Integration with Seurat
seurat_obj <- add_results_to_seurat(seurat_obj, result)# List pre-built references
available_references()Construct consensus GEPs from multiple cNMF analyses:
consensus <- BuildConsensusReference(
cnmf_paths = c("path/to/cnmf1", "path/to/cnmf2"),
ks = c(10, 15),
density_thresholds = c(0.1, 0.1),
output_dir = "./consensus_output",
corr_thresh = 0.5
)CellProgramMapper is optimized for computational efficiency:
- C++ Backend: Core NNLS solvers implemented in C++ via RcppArmadillo
- Sparse Matrix Support: Native handling of sparse matrices
- Parallel Processing: Optional parallelization via future framework
- Batch Processing: Memory-efficient processing of large datasets
The CellProgramMapper() function returns a CellProgramMapperResult object containing:
| Field | Description |
|---|---|
usage |
Raw usage matrix (cells × programs) |
usage_norm |
Normalized usage matrix (rows sum to 1) |
scores |
Computed add-on scores |
overlap_genes |
Genes used for mapping |
n_cells |
Number of cells processed |
n_programs |
Number of programs |
Detailed documentation and tutorials are available at:
- Quick Start Guide
- Mathematical Framework
- NNLS Solver Details
- Visualization Guide
- Building Custom References
- Lawson CL, Hanson RJ (1974). Solving Least Squares Problems. Prentice-Hall.
- Franc V, Hlavac V, Navara M (2005). Sequential Coordinate-Wise Algorithm for the Non-negative Least Squares Problem. CAIP 2005.
- Lee DD, Seung HS (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401:788-791.
If you use CellProgramMapper in your research, please cite:
@software{CellProgramMapper,
author = {Liu, Zaoqu},
title = {CellProgramMapper: Projection of Single-Cell Data onto Reference Gene Expression Programs},
year = {2026},
url = {https://github.com/Zaoqu-Liu/CellProgramMapper}
}MIT License © 2026 Zaoqu Liu
- Author: Zaoqu Liu
- Email: liuzaoqu@163.com
- GitHub: https://github.com/Zaoqu-Liu/CellProgramMapper