Consensus learning of gene expression programs from single-cell RNA-Seq data

Schematic of matrix factorization for disentangling identity and activity programs in scRNA-Seq data


We developed a computational framework based on matrix factorization for analyzing single-cell RNA-Seq data. As opposed to hard-clustering, which assign each cell to a single category, our approach models them as a mixture of gene expression programs. This allows the user to disentangle the contributions of cell type and cellular activities (e.g. cell-cycle, and response to environmental stimuli) from the data (see the figure above). The gene expression programs and their contributions to each cell type are learned from the data. We benchark different matrix factorization algorithms on simulated data and show that an adaptation of non-negative matrix factorization (NMF) that we call consensus NMF (cNMF) performed the best. Our method has already been used in several publications such as this work on bacterial sepsis, an analysis of lung cells in health and disease, and our work on Ebola infection.

Special thanks to Adrian Veres, my co-first author on this project.


Identifying gene expression programs underlying both cell-type identity and cellular activities (e.g. life-cycle processes, responses to environmental cues) is crucial for understanding the organization of cells and tissues. Although single-cell RNA-Seq (scRNA-Seq) can quantify transcripts in individual cells, each cell’s expression profile may be a mixture of both types of programs, making them difficult to disentangle. Here, we benchmark and enhance the use of matrix factorization to solve this problem. We show with simulations that a method we call consensus non-negative matrix factorization (cNMF) accurately infers identity and activity programs, including their relative contributions in each cell. To illustrate the insights this approach enables, we apply it to published brain organoid and visual cortex scRNA-Seq datasets; cNMF refines cell types and identifies both expected (e.g. cell cycle and hypoxia) and novel activity programs, including programs that may underlie a neurosecretory phenotype and synaptogenesis.