Ph.D. student Daniel Baker posted a preprint describing a new software library called Minicore for fast k-means clustering of single-cell RNA sequencing (scRNA-seq) data. Minicore works with sparse count data — which is the form scRNA-seq data usually starts in — as well as with dense data from after dimensionality reduction. Minicore uses a novel vectorized weighted reservoir sampling algorithm to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions.
Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and minibatch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware.
The open source library is at https://github.com/dnbaker/minicore.
Congratulations to the team, including Daniel, Nathan Dyjack, Vladimir Braverman, and Stephanie Hicks!