A new article published in Nature Methods entitled, "Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning" describes a novel computational framework developed by authors at Stanford University for the analysis and visualization of single cell RNA-seq (scRNA-seq) data. The novel approach called SIMLR (single-cell interpretation via multi-kernel learning), learns an appropriate cell-to-cell similarity metrics for dimension reduction and clustering of scRNA-seq data.
The authors report that SIMLR has three main advantages over current scRNA-seq analysis methods.
- SIMLR combines multiple kernels to learn a best distance metric for the data and does not rely on statistical assumptions that may not fit the diverse statistical characteristics of single cell data.
- SIMLR addresses the issue of high levels of dropout events.
- The similarities learned by SIMLR can be efficiently adapted into multiple down-stream steps.
Wang et al. benchmarked their novel approach against conventional scRNA-seq analysis methods using four published "gold standard" single-cell data sets (Buettner, Pollen, Kolodziejczyk, Usoskin) containing a variety of cell types. Providing only the raw gene expression files and number of cell types, SIMLR was able to learn a matrix of similarities between the cells that corresponded better to the gold standard labels than did standard similarity measures like correlation or Euclidian distance. SIMLR also demonstrated better performance over eight other dimension reduction methods, including principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE). In addition, SIMLR can be used for cell clustering using Affinity Propagation (AP) or K-means clustering.
After benchmarking against gold standard scRNA-seq datasets, the authors tested SIMLR on more challenging data, including a sparse dataset containing 2700 peripheral blood mononuclear cells (PBMC) generated on the GemCode™ platform. SIMLR with K-means clustering was able to identify major cell types and even a rare megakaryocyte population of 12 cells. SIMLR can also be applied to large scale scRNA-seq datasets to uncover similarity structures that would otherwise be concealed by noise or outlier effects in these large data sets.
In summary, the authors introduced a novel approach for scRNA-seq data analysis, SIMLR, that applies machine learning for dimension reduction and clustering for scRNA-seq data. The authors anticipate that the multiple-kernel learning framework may be especially beneficial for data that does not contain clearly identifiable clusters, such as dynamic cell populations that are changing due to cell division, growth or differentiation along a developmental pathway.