Analysis Guides/

Clustering Methods for Single Cell and Spatial Data

Feb 6, 2026
Share via:

Note: 10x Genomics does not provide support for community-developed tools and makes no guarantees regarding their function or performance. Please contact tool developers with any questions. If you have feedback about Analysis Guides, please email analysis-guides@10xgenomics.com.

Clustering and classification are fundamental computational techniques used to identify groups of cells that share similar characteristics, often based on their gene expression profiles. This process helps us uncover the hidden biological stories within a dataset, from identifying rare cell types to characterizing the overall heterogeneity of a cell population.

Standard single cell RNA sequencing (scRNA-seq) datasets, generated with 10x Genomics Chromium assays, rely primarily on gene expression features from individual cells. By contrast, the field of spatial transcriptomics encompasses technologies such as 10x Genomics Visium HD and Xenium In Situ, which capture gene expression along with the spatial context, but differ in their resolution and molecular profiling approaches. Visium HD provides spatially resolved transcriptomic data at 2-µm resolution across tissue sections. Increasingly, downstream clustering for Visium HD is performed on cell-segmented expression profiles, rather than raw bins, to better align results with biological cells. In contrast, Xenium In Situ enables high-resolution, subcellular mapping of gene expression at the single-molecule level directly within tissue sections. These technologies have contributed to a growing number of methods that integrate spatial information into the clustering process.

This article focuses on clustering and classification methods for Chromium, Visium, and Xenium datasets. We first review established techniques for analyzing single cell data, then focus on spatially-informed clustering approaches tailored for the variable resolutions of Visium data (from spots to HD squares) and Xenium data (high-resolution cellular and subcellular). Many methods originally developed for Visium data may potentially be adapted to Xenium data, given its enriched spatial and gene expression information, thus expanding the toolbox available for integrative analysis.

The table below summarizes typical analysis questions for single cell and spatial datasets, the recommended clustering or classification strategies, and the key input sources required. This may help inform decisions on when a single cell analysis is sufficient and when spatially-informed methods add value.

Data TypeTypical Biological QuestionRecommended Clustering Strategy
Chromium (scRNA-seq)Identify major and rare cell types; build a cell-type atlas across samplesStandard graph-based clustering (with optimal sub-clustering for rare populations); occasional supervised classifiers for annotation
Visium (tissue-level domains)Define broad tissue regions (e.g., tumor vs. stroma, cortical layers, fibrotic vs. normal)Neighborhood-aware or morphology-integrated spatial clustering; focus on spatial smoothness and domain boundaries
Visium (fine-grained patterns within mixed spots)Resolve finer spatial structure where each spot contains multiple cell types (e.g., layered retina, tumor-immune interface)Resolution-enhancing spatial clustering to computationally subdivide spots
Visium HD (cell-segmented)Resolve fine-scale tissue architecture and cell states where reliable cell segmentation is available, enabling more biologically interpretable clustering than bin-based representations.Cell-level clustering on segmented expression profiles; apply standard graph-based or neighborhood-aware clustering to segmented cells to achieve cleaner, more interpretable clusters than bin-based approaches; fallback to spatial clustering on fixed-resolution bins when segmentation is not available.
Xenium (cell types in situ)Identify cell types and states directly in tissue, at single cell resolutionStandard single cell clustering approaches can be applied to segmented cells, optionally augmented with local neighborhood information
Xenium (microenvironments and niches)Define cell-cell interaction zones (e.g., immune niches, tumor borders, stem-cell niches)Neighborhood-aware spatial clustering where the local environment is explicitly modeled.
Single cell and spatial (cross-modality integration)Transfer single cell-defined cell types onto spatial data; map cell states into tissue context2-step workflow: a) single cell graph-based clustering b) label transfer, deconvolution and spatial clustering.

When analyzing single cell and spatial data, the choice of clustering approach depends on the biological question, and the nature of the research project. Broadly, clustering and classification algorithms for these types of datasets fall into two categories:

  1. Unsupervised clustering identifies patterns without prior information. These clustering approaches are used to identify novel cell populations and intrinsic structure in the data without prior assumptions. They are useful for discovery and hypothesis generation. For example, graph-based methods can be used to uncover unbiased clusters in single cell data, or incorporate neighborhood information into spatial data.
  2. Supervised methods uses references to guide classification. These approaches are used to assign known identities based on training data, references, or marker sets. These methods are useful for annotation and reproducibility. Examples include label transfer from reference datasets for Xenium In Situ data.
ApproachExample MethodsWhen to UseNotes
UnsupervisedGraph-based (Leiden, Louvain), K-means, Hierarchical, Hidden Markov Models, Deep LearningDiscover new/rare populationsFlexible, parameter-sensitive
SupervisedSupport Vector Machines (SVMs), Decision Trees, Label TransferAssign known cell typesRelies on curated reference data

The following sections will highlight commonly-used algorithms under these two approaches for single cell and spatial datasets.

The following clustering methods are foundational to single cell analysis, where the primary goal is to organize dissociated cells into meaningful groups based on their gene expression profiles. While these techniques are also useful for spatial analysis, their principles are best introduced in the context of single cell data. Below is a summary of these core methods along with their key features and common use cases:

MethodApproachKey FeaturesUse Case
Graph-basedUnsupervisedScales to 100k+ cells; can capture complex relationships between individual cells; resolution tuningStandard for large datasets; widely adopted in the community (e.g., Seurat, Scanpy)
K-meansUnsupervisedRequires k upfrontCoarse clustering, broad groups
HierarchicalUnsupervisedBuilds cluster trees (dendrograms)Small datasets, visualization (cell lineages)
SVMs/Decision TreesSupervisedLeverages references for classificationCell annotation (for well-defined tissue/cell types)

A common strategy for analyzing single cell data is to first perform a coarse clustering to identify major cell types, followed by re-clustering within individual clusters to capture finer subtypes and heterogeneity. This approach is particularly useful for large or diverse cell populations, as it helps distinguish broad cell types before refining sub-populations. How exactly this is executed depends on the underlying mathematical framework of the chosen algorithm. The details of some prominent methods are listed below:

  • Graph-based unsupervised clustering involves building a sparse nearest-neighbor graph to identify highly-connected "modules" or "clusters". A major advantage of this method is its scalability, driven by efficient use of a sparse nearest-neighbor graph allowing for superior memory and computational efficiency. This is crucial in single cell analysis, where datasets can contain millions of cells. As these datasets grow, efficient methods are needed to handle them without excessive computational time or memory usage, making scalability a key benefit. However, the sensitivity to parameter settings can impact the quality and interpretation of the resulting clusters, as illustrated in a study by Zheng et al. 2017, Cell.
  • In K-means unsupervised clustering, a fundamental assumption is that we have a priori knowledge of the expected number of clusters within the dataset, and the algorithm assigns cells according to this predetermined number. The "k" value represents the number of clusters. It is typically chosen based on prior knowledge, exploratory analysis (e.g., elbow plots), or domain expertise, rather than being fixed by the algorithm itself. This approach is particularly useful for large or diverse cell populations to help distinguish broad cell types before refining subpopulations. One example is highlighted in this study, Identifying cell populations with scRNASeq, where k-means clustering was used to identify cell subpopulations in large datasets from tissues including the mouse hypothalamus.
  • Hierarchical unsupervised clustering is a method that builds distances between data points through a series of successive fusions of individual objects, thus building a hierarchy by combining the nearest data points and merging them into clusters. This offers flexibility and a visual representation in understanding relationships among data points. However, this analysis can be computationally intensive when dealing with large datasets. In this study, this method was used to analyze early organ development in human embryos, offering valuable insights into cellular organization and developmental processes.
  • Decision trees are supervised machine learning algorithms used in scRNA-seq to classify cells based on gene expression profiles. They work by recursively splitting the data at decision nodes to create distinct classes, making them useful for cell typing and subtype identification. Decision trees are straightforward to interpret, as they provide clear rules for classifying cells. However, the method can struggle with high-dimensional data and may overfit, especially when the number of features exceeds the number of samples. Pruning and setting depth limits can help mitigate overfitting. Despite these challenges, decision trees are effective for classification tasks and can provide insights into the most important features driving cell-type differentiation and gene expression patterns.

Below are some third party tools and clustering methods that are commonly used in the single cell community.

Toolsprimarily uses prior knowledgeClustering ProcessTutorialUse-casesReferences
CellAssign (R)YesGuided clusteringAutomated, probabilistic assignment of scRNA-seq to cell typesAssign specific cell types using a known marker gene list (Tumor Microenvironemnt)Zhang et al., Nature Methods, 2019
CIDR (R)NoModified K-means clustering (K-PCA)Clustering through Imputation and Dimensionality ReductionDatasets with high dropout dates; clustering without heavy imputation stepsLin et al., Genome Biology, 2017
CytoTRACE (python)YesDecision tree and Hierarchical clusteringCellular (Cyto) Trajectory Reconstruction Analysis using gene Counts and ExpressionPredicting differentiation states; identifying stem-like vs differentiated cells.Gulati et al., Science, 2020
Monocle (R)NoLeiden clusteringClustering and classifying your cellsTrajectory inference; ordering cells along a developmental timelineTrapnell et al., Nature Biotechnology, 2014
RCA (R)YesCorrelation score between reference and the single cell data, in supervised manner.Reference Component AnalysisReference-based annotation; ensuring consistency with a specific reference atlasLi et al., Nature Genetics, 2017
SC3 (R)NoHierarchical clusteringSC3 package manualHigh-accuracy consensus clustering; best for smaller, high-quality datasets (conputationally intensive)Kiselev et al., Nature Methods, 2017
Scanpy (Python)NoMultiple clustering options: Louvain, leiden, K-means, Hierarchical etcPreprocessing and clustering 3k PBMCsProcessing massive datasets (1M+ cells) efficiently; Python-based workflowsWolf et al., Genome Biology, 2018
scmap_cluster (R)NoHierarchical clusteringScmap package vignetteProjecting cells onto a reference datasets for fast annotationVladimir et al., Nature Methods, 2018
Azimuth (R/web)YesReference-based mapping; (Anchor transfer & WNN projection)Azimuth tutorialSeurat has standard, versatile pipeline for discovery, integration, and multimodal analysis; also include automated cell-type annotation using azimuthHao et al., Cell, 2021
Seurat_clustering (R)NoGraph-based (Louvain) & Leiden clusteringSeurat Clustering TutorialSeurat has standard, versatile pipeline for discovery, integration, and multimodal analysis; also include automated cell-type annotation using azimuthStuart et al., Cell, 2019
SHARP (R)NoNon-negative matrix factorization (NMF) and consensus clustering, scalable and very fastSingle-cell RNA-seq Hyper-fast and Accurate processing via ensemble Random ProjectionHyper-fast clustering for very large datasetsShibiao et al., Genome Research, 2020
SingleR (R)YesGuided clusteringAssigning cell types with SingleRAutomated cell type annotation comparing query data to reference datasetsAran et al., Nature Immunology, 2019

If the goal is to detect rare cell types, employing unsupervised methodologies akin to those found in Seurat or monocle3 is advantageous due to their flexibility in exploring data without prior assumptions. Conversely, for classification based on a pre-established reference, tools like RCA or CytoTRACE offer robust options for supervised analysis, ensuring consistency and reliability when comparing new data to known cell types.

Spatial transcriptomics provide a significant advantage over dissociated single cell techniques by preserving the native location of cells within a tissue, linking gene expression profiles to their spatial context. This allows spatial clustering algorithms to group cells, squares, or spots based not only on transcriptional similarity but also on their spatial relationships, revealing tissue architecture and multicellular organization.

Unlike traditional scRNA-seq clustering methods that operate on a gene expression matrix alone, spatial clustering algorithms leverage two primary inputs: the gene expression matrix and the spatial coordinates of each cell or measurement spot. Optionally, additional data modalities, such as H&E images or segmentation details, could be used. By integrating these inputs, spatial clustering algorithms can identify clusters that are both transcriptionally distinct and spatially coherent. Choosing the most appropriate algorithm therefore depends on the data quality, the biological question, the experimental goal of the study, and the available inputs.

The major spatial clustering tools can be conceptually grouped by the primary type of information they integrate alongside gene expression data.

MethodsKey Input (Beyond Expression & Coordinates)Example ToolsUse case
Neighborhood-awareNoneBANKSYDefining functional domains where the local environment and cell-cell signaling are important (e.g., tumor-immune boundary).
Morphology IntegrationH&E Histology ImageSpaGCN, stLearnAligning transcriptional domains with anatomically distinct features visible in the tissue slide (e.g., cortical layers in the brain).
Resolution EnhancementNone (uses statistical modeling)BayesSpaceComputationally increasing the resolution of spot-based data to uncover fine-grained spatial patterns hidden by technical limits.

BANKSY

  • Publication: BANKSY: A Spatial Omics Algorithm that Unifies Cell Type Clustering and Tissue Domain Segmentation
  • Source code: https://github.com/prabhakarlab/Banksy and https://github.com/prabhakarlab/Banksy_py (Python)
  • Tutorial: BANKSY tutorial
  • Algorithm: BANKSY operates on the principle that a cell's identity is defined by both its intrinsic gene expression and that of its surrounding neighborhood. It creates an "augmented" expression matrix that combines a cell's own profile with the aggregated profile of its neighbors. A key parameter, λ (lambda), allows the user to tune the influence of the neighborhood, enabling a flexible analysis that can prioritize either cell-intrinsic states or broader tissue environments.
  • Use case: Use BANKSY when you want to identify tissue regions where the local cellular environment is a key defining feature. For example, in a tumor microenvironment, you could use it to find "hotspots" of immune cell activity that influence nearby cancer cells. Or in developmental biology, to identify organizing centers where a small group of cells patterns the development of its neighbors. The tunable λ parameter controls how much influence a cell's neighborhood has on its clustering; lower values emphasize individual cell identity, while higher values highlight spatial context. In practice, λ is typically chosen empirically by testing a small range (e.g., 0–1) and selecting the value that yields biologically meaningful clusters or consistent domain patterns. BANKSY is actively recommended and used for Visium (including cell-segmented Visium HD) for neighborhood-aware clustering. It is also the recommended choice for very large datasets because it scales efficiently to millions of cells without requiring specialized hardware.

SpaGCN

  • Publication: SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network
  • Source code: https://github.com/jianhuupenn/SpaGCN
  • Tutorial: SpaGCN tutorial
  • Algorithm: SpaGCN integrates three sources of data: gene expression, spatial coordinates, and histology from the H&E image. It constructs a graph where spots are connected based on their spatial proximity and histological similarity. A graph convolutional network is then used to aggregate gene expression information from neighboring spots, effectively smoothing the data in a way that respects the underlying tissue morphology.
  • Use case: SpaGCN is the ideal tool when you expect the transcriptional domains in your tissue to align closely with structures visible in the histology slide. For instance, if you are analyzing a brain slice with distinct cortical layers, or a kidney section with clearly defined glomeruli and tubules, SpaGCN can delineate these structures with high fidelity by ensuring that cluster boundaries respect both the molecular and the morphological data. For massive datasets (Xenium, binned Visium HD), utilizing GPU is recommended to maintain reasonable runtimes during the graph construction phase.

stLearn

  • Publication: stLearn: integrating spatial location, tissue morphology and gene expression to find cell types, cell-cell interactions and spatial trajectories within undissociated tissues
  • Tutorial: stSME clustering tutorial
  • Algorithm: stLearn also integrates H&E image features with gene expression, but through a different mechanism. It first processes the image to extract morphological features for each spot. It then normalizes each spot's gene expression based on the expression of its neighbors, weighted by their morphological similarity. This process helps to correct expression data and align it with the physical tissue structure before clustering.
  • Use case: Use stLearn for similar scenarios as SpaGCN, where tissue morphology is a critical component of the analysis. It is particularly useful when you believe that spots sharing a similar morphological appearance (e.g., part of the same fibrotic region or cell layer) should be grouped together, even if they are not directly adjacent. This helps in defining tissue domains that are consistent with the underlying histology.

BayesSpace

  • Publication: Spatial transcriptomics at subspot resolution with BayesSpace
  • Source code: BayesSpace
  • Tutorial: BayesSpace tutorial
  • Algorithm: BayesSpace employs a fully Bayesian statistical model with a Markov random field to encourage neighboring spots to belong to the same cluster, thereby producing spatially smooth domains. Its standout feature is the ability to perform "resolution enhancement". For spot-based data like Visium, where each spot may contain multiple cells, BayesSpace can computationally partition each spot into smaller sub-spots and infer their gene expression, revealing finer spatial structures that were previously obscured.
  • Use case: The primary use case for BayesSpace is to overcome the resolution limits of spot-based spatial technologies. If you are studying a complex tissue where distinct cell types are intricately mixed within single Visium spots (e.g., fine layers of the retina, kidney glomeruli, or the boundary between a tumor and stroma), BayesSpace is helpful for computationally dissecting these spots and uncovering more granular and biologically meaningful spatial domains. Note that because of its computationally intensive Bayesian framework, BayesSpace may require long runtimes when applied to large-scale datasets; subsampling or running on smaller regions of interest may be necessary.

Emerging spatial clustering tools

A new wave of peer-reviewed spatial clustering tools is emerging, offering improved methodologies for integrating and interpreting complex biological datasets. This table summarizes these tools.

ToolReferenceCore Function/AnalogyKey Technical Advantage
iIMPACTJiang et al., Genome Biology, 2024Integrates imaging and molecular profiles ("Triple-lens camera").Sharpening the definition of tissue domains by combining data modalities.
WESTCai et al., Cell Reports, 2024Weighted ensemble method for stability ("Jury of clustering tools").Provides more stable clustering outcomes by aggregating results from multiple algorithms.
KBCZhang et al., Genome Research, 2024Kernel-bounded clustering adapted for spatial data ("Mathematical compass").Scales well to large datasets, offering a robust and efficient solution for high-throughput data.
SpaGICLiu et al., Briefings in Bioinformatics, 2024Graph-informed clustering leveraging image cues ("Map overlay").Enhanced domain discovery by aligning molecular and histology patterns (successor to SpaGCN).
stMSAShu et al., Genome Research, 2025Multi-Slice Alignment model for context ("Zoom lens").Captures both local detail and broad tissue context simultaneously through multi-slice alignment.

When selecting an algorithm, it is critical to remember the No Free Lunch Theorem (Wolpert & Macready 1997): there is no "one-size-fits-all" solution. The "best" algorithm is simply the one whose assumptions best align with your specific data structure (e.g., dissociated cells vs. spatial spots, continuous bins, or segmented cells) and your biological question (e.g., finding cell types vs. defining anatomical domains).

The decision tree below is provided to help you decide the clustering method that best aligns with your data type and research goals:


Obtaining a set of clusters is just the beginning of the interpretation phase. The initial clusters are computational hypotheses that must be rigorously validated and biologically interpreted. This is an iterative process of refinement, where the analyst cycles between clustering, marker identification, and biological annotation to arrive at a meaningful result.

Interactive tools provide a powerful environment for iterative validation, enabling users to visually explore clusters, marker expression, and spatial context in an integrated and intuitive way (see Loupe Browser tutorials for Chromium and Visium data and Xenium Explorer tutorials for Xenium In Situ data).

Step 1: Marker Gene Analysis

The primary method for assigning biological meaning to clusters is to identify marker genes, i.e., genes that are significantly more highly expressed in one cluster compared to all others.

  • Finding markers: Standard functions like FindAllMarkers() in Seurat or sc.tl.rank_genes_groups() in Scanpy perform differential expression tests to identify these genes for each cluster. It is important to view the top markers as the most trustworthy, as p-values can be inflated in single cell data where each cell is treated as a replicate.
  • Visualizing markers: Data visualization is essential for interpreting marker gene lists:
    • Heat maps provide a global view of the top marker genes for every cluster, showing a clear block-diagonal structure if the clustering is robust.
    • Violin and dot plots are used to examine the expression of specific, canonical marker genes across all clusters. Violin plots show the full expression distribution, while dot plots concisely convey both the average expression level (color) and the percentage of cells in the cluster expressing the gene (dot size).
    • Feature plots overlay the expression of a single gene onto the UMAP or spatial tissue plot. This is arguably the most powerful way to confirm that a marker gene is both specific to and consistently expressed across the cells of a given cluster.

In Loupe Browser, this validation step can be carried out by selecting clusters and overlaying gene expression directly onto UMAP embeddings or spatial tissue images, allowing rapid visual confirmation of cluster specificity and spatial coherence. For more guidance, see the Loupe tutorials on exploring spatial clusters and HD spatial gene expression data. This is often the intuitive way to confirm that a marker gene is both specific to and consistently expressed across the cells of a given cluster.

Step 2: Assigning Cell Identity

With a list of marker genes for each cluster, the next step is annotation. This involves comparing the identified markers to known canonical markers from the literature and public databases (e.g., CellMarker) to assign a biological cell type label to each cluster.

Loupe Browser further supports this process by enabling rapid inspection of canonical marker expression across clusters, which can accelerate biological annotation and validation.

Step 3: The Refinement Loop

The initial annotation often reveals opportunities to refine the clustering.

  • Merging clusters: If two or more clusters share a very similar set of marker genes and are adjacent on the UMAP, they likely represent the same cell type that was over-split by the algorithm. The solution is to re-run the clustering at a lower resolution to merge them.
  • Splitting and subclustering: Conversely, if a single cluster appears to contain multiple cell types (e.g., it expresses markers for both T cells and B cells) or shows bimodal expression for key genes, it may be under-clustered. The solution is to either increase the resolution or to subset that cluster and re-cluster it on its own—a process known as subclustering—to reveal finer-grained subtypes.

Common Pitfalls and Sanity Checks

A crucial part of validation is using clustering as a tool for anomaly detection. The clusters should not only reflect biology but also reveal potential issues with the data.

  • Check for technical confounders: A common pitfall is finding clusters that are defined by technical artifacts rather than biology. A cluster with high expression of only mitochondrial genes likely represents a group of dying cells. A cluster that perfectly separates cells from different experimental batches indicates a strong, uncorrected batch effect. These clusters are not novel cell types and indicate data quality issues that should be addressed.
  • Identify the "junk" cluster: It is common to find one or more clusters composed of low-quality cells (low gene counts) or doublets. These should be identified and considered for removal before final interpretation.
  • Avoid over-interpretation: Be cautious when interpreting very small clusters. While they could represent a rare cell type, they could also be the result of stochastic noise or a few outlier cells. Their existence should be validated with robust marker expression and, if possible, orthogonal methods.

By treating clusters as hypotheses and using them to probe both the biology and the quality of the data, an analyst can move from a simple computational result to a robust and defensible biological conclusion.

No clustering algorithm, however sophisticated, can rescue a dataset plagued by technical artifacts and low-quality measurements. This principle is the most critical concept in single cell and spatial analysis. A robust preprocessing workflow is the non-negotiable foundation upon which all subsequent biological discoveries are built.

Stay connected with latest technical workflow and software updatesSubscribe to newsletter