Secondary Analysis of Xenium In Situ Gene Expression Data

Secondary analysis only includes transcripts that pass the default quality value (Q-Score) threshold of Q20 and are assigned to cells. Before running any secondary analysis steps, the XOA pipeline filters out cells with fewer than five transcripts. These cells are thus not included in the secondary analysis output files (analysis/ and analysis.zarr.zip).

This filter is not applied to other outputs, such as the cell-feature matrix and transcripts files. Filtered cells will be categorized as "Unassigned" when viewing K-means and graph-based cluster results in Xenium Explorer.

In order to reduce the gene expression matrix to its most important features, XOA uses Principal Components Analysis (PCA) to change the dimensionality of the dataset from (cells x genes) to (cells x M) where M is a number of principal components (10 PCs retained). The pipeline uses a python implementation of the IRLBA algorithm, (Baglama & Reichel, 2005), which we modified to reduce memory consumption. Note that if the data contains protein data, only the gene expression data will be used for PCA and subsequent analysis.

XOA also supports visualization with UMAP (Uniform Manifold Approximation and Projection), which estimates a topology of the high dimensional data and uses this information to estimate a low dimensional embedding that preserves relationships present in the data. The pipeline uses the python implementation of this algorithm by McInnes et al (2018). XOA uses the following default parameter values for UMAP generation:

n_neighbors = 30
min_dist = 0.3
n_components = 2
spread = 1.0
metric = "pearson"
n_epochs = 200 or 500
learning_rate = 1.0
random_seed = 0

XOA uses two methods for clustering cells based on expression similarity, both operating in the PCA space.

The graph-based clustering algorithm consists of building a sparse nearest-neighbor graph (where cells are linked if they among the k nearest Euclidean neighbors of one another), followed by Louvain Modularity Optimization (LMO; Blondel, Guillaume, Lambiotte, & Lefebvre, 2008), an algorithm which seeks to find highly-connected "modules" in the graph. The value of k, the number of nearest neighbors, is set to scale logarithmically with the number of cells.

Additionally, it performs a cluster-merging step using hierarchical clustering of cluster medoids in PCA space. This step merges pairs of sibling clusters if there are no differentially expressed genes or if present, protein features, between them (with a Benjamini-Hochberg adjusted p-value below 0.05). The hierarchical clustering and merging is repeated until there are no more cluster-pairs to merge. Protein feature information is only used to determine which clusters to merge, and not for the graph-based clustering output itself.

The use of LMO to cluster cells was inspired by a similar method in the R package Seurat.

XOA also performs traditional K-means clustering across a range of K values, where K is the preset number of clusters.

To test for differences in mean expression between groups of cells, XOA uses the exact negative binomial test proposed by the authors of the sSeq method (Yu, Huber, & Vitek, 2013). When the counts become large, XOA switches to the fast asymptotic negative binomial test used in edgeR (Robinson & Smyth, 2007).

For each gene and each cluster i, XOA tests whether the mean expression in cluster i differs from the mean expression across all other cells.

The mean expression of a feature for cluster i is calculated as the total number of transcripts from that feature in cluster i divided by the sum of the size factors for cells in cluster i. The size factor for each cell is the total transcript count in that cell divided by the median transcript count per cell (across all cells). The mean expression outside of cluster i is calculated in the same way. The log2 fold-change of expression in cluster i relative to other clusters is the log2 ratio of mean expression within cluster i and outside of cluster i. When computing the log2 fold-change, a pseudocount of 1 is added to both the numerator and denominator of the mean expression.

Note that XOA's implementation differs slightly from the sSeq paper: the size factors are calculated using total transcript count instead of using DESeq's geometric mean-based definition of library size. As with sSeq, normalization is implicit in that the per-cell size factor parameter is incorporated as a factor in the exact and asymptotic probability calculations.

Baglama J and Reichel L. Augmented implicitly restarted Lanczos bidiagonalization methods. SIAM Journal on Scientific Computing 27: 19–42, 2005.
Blondel V, et al. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008.
McInnes L, Healy J, and Melville J. UMAP: Uniform Manifold Approximation and Projection for dimension reduction. arXiv, 2018.
Robinson M and Smyth G. Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9: 321–332, 2007. Link to edgeR source.
Yu D, Huber W, and Vitek O. Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size. Bioinformatics 29: 1275–1282, 2013.

Back to algorithms overview

Secondary Analysis of Xenium In Situ Gene Expression Data

Filtering

Dimensionality reduction

PCA

UMAP

Clustering

Graph-based

K-means

Differential gene expression

References