10x Genomics Support/Space Ranger/Algorithms Overview/

Space Ranger Algorithms: Read Processing

A full length cDNA construct is flanked by the 30 bp template switch oligo (TSO) sequence, AAGCAGTGGTATCAACGCAGAGTACATGGG, on the 5' end and polyA on the 3' end. Some fraction of sequencing reads are expected to contain either or both of these sequences, depending on the fragment size distribution of the sequencing library. Reads derived from short RNA molecules are more likely to contain either or both TSO and polyA sequence than longer RNA molecules.

Since the presence of non-template sequence in the form of either template switch oligo (TSO) or polyA, low-complexity ends confound read mapping, TSO sequence is trimmed from the 5' end of read 2 and polyA is trimmed from the 3' end prior to alignment. Trimming improves the sensitivity of the assay as well as the computational efficiency of the software pipeline. Tags ts:i and pa:i in the output BAM files indicate the number of TSO nucleotides trimmed from the 5' end of read 2 and the number of polyA nucleotides trimmed from the 3' end. The trimmed bases are present in the sequence of the BAM record, and the CIGAR string shows the position of these soft-clipped sequences.

For polyA-based assays, Space Ranger uses STAR to perform splicing-aware alignment of transcript reads to the genome. This is the default algorithm unless the --probe-set option is invoked for FFPE tissues, see below. After alignment, Space Ranger uses the transcript annotation GTF file to count each read as either exonic, intronic, or intergenic. Space Ranger counts a read as exonic if at least 50% of it intersects an exon, and as intronic if > 50% of its bases map to a gene but not to one of that gene’s exons. Otherwise, the read is counted as intergenic. For reads that align to a single exonic locus, but also align to one or more non-exonic loci, the exonic locus is prioritized and the read is considered to be confidently mapped to the exonic locus with mapping quality (MAPQ) of 255. All uniquely mapping reads have a MAPQ of 255. For multi-mapping reads, the MAPQ score is defined as

MAPQ=int(10log10(11Nmap))MAPQ=int(-10*log_{10}(1-\frac{1}{N_{map}}))

where Nmap{N_{map}} is the number of loci a read can map to. BAM tag NH:i gives the value of Nmap{N_{map}} which is 1 for uniquely mapped reads and > 1 for multi mapped reads.

Space Ranger further aligns exonic reads to annotated transcripts, looking for compatibility. A read that has bases 100% compatible with the exons of an annotated transcript, and aligned to the same strand, is considered mapped to the transcriptome. Space Ranger ignores antisense reads that are defined as any read with alignments to an entire gene on the opposite strand and no sense alignments. If the read is compatible with a single gene annotation, it is considered uniquely (confidently) mapped to the transcriptome. These confidently mapped reads are the only ones considered for UMI counting.

In Visium probe-based assays, whole transcriptome probe panels, consisting of a pair of probes for each targeted gene, are added to the tissue. These probe pairs hybridize to their target transcript and are then ligated together.

To analyze FFPE data it is necessary to use the --probe-set option to specify a probe set reference CSV file. When this option is invoked, Space Ranger will count ligation events using the probe aligner algorithm. Reads are also aligned to the reference transcriptome using STAR, but only to determine their alignment positions and CIGAR strings; STAR alignments are not used to assign reads to genes for FFPE data. Sequencing reads are aligned to the probe set reference and assigned to the genes they target. The probe alignment algorithm is similar to a seed-and-extend aligner, where each half of the read is a seed, and is described in detail below.

  • Build an index of the half-probe sequences in the probe set reference CSV.
  • For each read, look up each read half in this index, allowing up to one mismatch and no indels.
  • If both halves of the read map to the same probe ID, the read is confidently mapped with a MAPQ (mapping quality) of 255. Only reads with a mapping quality value of 255 contribute to UMI counts.
  • If one half maps and the other half does not map, compare the sequence of the unmapped half to the expected sequence for that probe, allowing mismatches but not indels. If it matches exceeding the minimum alignment score (30, scoring +1 for a match and -1 for a mismatch), the read is confidently mapped.
  • If the unmapped half of the read does not match the expected sequence for that probe, the read is half-mapped with a MAPQ of 1, and does not contribute to UMI counts.
  • If both halves of the probe map but to different probes, the read is ambiguously mapped with a MAPQ of 3, and does not contribute to UMI counts.
  • If neither probe half aligns to the probe set reference by the probe aligner algorithm, the read is unmapped and will have MAPQ=0 (column 5). If neither probe half aligns to the probe set reference by the probe aligner algorithm and the reference transcriptome by STAR, the read will be UNMAPPED (0x4) in FLAG (column 2) and also have MAPQ=0.
  • All reads with up to three mismatches are guaranteed to align.

The BAM tag pr:Z reports a semicolon-separated list of probe IDs. See the Space Ranger BAM page.

Reads are confidently mapped only if left-hand side (LHS) and right-hand side (RHS) sequences are correctly paired.

To determine whether a given barcode sequence is correct, Space Ranger compares the observed barcodes to the known barcodes for a given assay chemistry, which are stored in a barcode whitelist file. For example, there are 4,992 barcodes in the whitelist for Visium v1/v2 slides with 6.5 mm capture area, 14,336 barcodes for Visium v2 slides with 11 mm capture area, and over 11 million barcodes for Visium HD slides (6.5 mm capture area).

For Visium v1/v2, Space Ranger uses the following algorithm to correct putative barcode sequences against the whitelist:

  1. Count the observed frequency of every barcode on the whitelist in the dataset.
  2. For every observed barcode in the dataset that is not on the whitelist and is at most one Hamming distance away from the whitelist sequences:
  • Compute the posterior probability that the observed barcode did originate from the whitelist barcode but has a sequencing error at the differing base (by base quality score). Used for thresholding which barcodes are corrected vs. discarded.
  • The corrected barcodes are used for all downstream analysis and output files.

In the output BAM file, the original uncorrected barcode is encoded in the CR tag, and the corrected barcode sequence is encoded in the CB tag. Reads that cannot be assigned a corrected barcode will not have a CB tag.

Visium HD follows a similar algorithm, with the following differences. The first 43 bases of Read 1 include the UMI and the barcode. Instead of Hamming distance, Space Ranger corrects barcodes using the edit distance, which allows for insertions, deletions, and substitutions. Up to four edits are permissible to correct a barcode to the whitelist.

The probe reference is filtered to remove genes/features where one or more of the probes targeting this feature might hybridize and ligate at non-targeted loci. Probes that are predicted to have off-target activity to homologous genes or sequences are excluded from analysis by default. These probes are marked with FALSE in the included column of the probe set reference CSV. Any gene that has at least one probe with predicted off-target activity will be excluded from filtered outputs. Setting the --filter-probes=false command line argument of spaceranger count will result in UMI counts from all non-deprecated probes, including those with predicted off-target activity, to be used in the analysis. Probes whose ID is prefixed with DEPRECATED are always excluded from the analysis.

Before counting UMIs, Space Ranger attempts to correct for sequencing errors in the UMI sequences. Reads that were confidently mapped to the transcriptome are placed into groups that share the same barcode, UMI, and gene annotation. If two groups of reads have the same barcode and gene, but their UMIs differ by a single base (i.e., are Hamming distance 1 apart), then one of the UMIs was likely introduced by a substitution error in sequencing. In this case, the UMI of the less-supported read group is corrected to the UMI with higher support.

Space Ranger again groups the reads by barcode, UMI (possibly corrected), and gene annotation. If two or more groups of reads have the same barcode and UMI, but different gene annotations, the gene annotation with the most supporting reads is kept for UMI counting, and the other read groups are discarded. In case of a tie for maximal read support, all read groups are discarded, as the gene cannot be confidently assigned.

After these two filtering steps, each observed barcode, UMI, gene/feature combination is recorded as a UMI count in the unfiltered feature-barcode matrix. The number of reads supporting each counted UMI is also recorded in the molecule info file.

The Modifiable Areal Unit Problem (MAUP) is a principle in geography that highlights the influence of spatial unit delineation on analytical outcomes (Zormpas et al., 2023). This concept underscores that the scale of units used in spatial analysis can affect interpretation and findings. By default, Space Ranger v3.0 outputs Visium HD data at three scales, offering 8 µm and 16 µm bin sizes in addition to the native 2 µm resolution. Bins offer the advantage of containing more mean transcripts per areal unit — an 8 µm bin provides a 16-fold increase on average compared to a 2 µm square — thereby increasing the signal-to-noise ratio while maintaining single cell scale resolution. Space Ranger users can also set a custom bin size in microns (only even integer values between 10 and 100 are allowed), or customize binning with third party tools. Although optimal bin size might vary depending on the research focus, tissue type, and other variables, the 8 µm bin size is anticipated to be an effective starting point for most researchers. Nevertheless, it recommended to remain aware of the MAUP and other scale effects when working with Visium HD data.

Space Ranger automatically detects spots under the tissue section (see Tissue Detection). Only the barcodes or bins associated with these spots are captured in the filtered feature-barcode matrix and are used for downstream analyses. Users can alternatively select tissue manually with manual alignment in Loupe.

For Visium HD, bins are determined to be under tissue if any of the underlying 2 µm barcoded squares are under tissue.

Space Ranger uses the feature barcode matrix and the spatial location data from the image to generate secondary analysis results including clusters, t-SNE and UMAP projections as well as differential gene expression between the clusters. All of the secondary data are recorded in the cloupe.cloupe file(s) which can be visualized in Loupe Browser.

For Visium HD, secondary analysis is only performed at 8 µm / 16 µm / custom bin sizes (not 2 µm).

PCA: In order to reduce the gene expression matrix to its most important features, Space Ranger uses Principal Components Analysis (PCA) to change the dimensionality of the dataset from (spots x genes) to (spots x M) where M is 10. The pipeline uses a Rust implementation of a randomized block Krylov algorithm, (Musco & Musco, 2015).

t-SNE: For visualizing data in 2-D space, Space Ranger passes the PCA-reduced data into t-Stochastic Neighbor Embedding (t-SNE), a nonlinear dimensionality reduction method (Van der Maaten, 2014). The C++ reference implementation by Van der Maaten (2014) was modified to take a PRNG seed for determinism. The runtime is also decreased by fixing the number of output dimensions at compile time to two or three. The t-SNE is visualized in both web_summary.html and in Loupe Browser.

t-SNE is not run on Visium HD data for performance reasons. Users can alternatively use UMAP.

UMAP: Space Ranger also supports Uniform Manifold Approximation and Projection (UMAP), which estimates a topology of the high dimensional data and uses this information to estimate a low dimensional embedding that preserves relationships between datapoints (McInnes et al., 2018). The pipeline uses a custom Rust implementation of this algorithm. UMAP coordinates are available in the pipeline output, but not displayed in the web_summary.html and in Loupe Browser.

Space Ranger uses two different methods for clustering spots by expression similarity, both of which operate in the PCA representation.

Graph-based clustering: The graph-based clustering algorithm consists of building a sparse nearest-neighbor graph (where spots are linked if they are among the k nearest Euclidean neighbors of one another), followed by Louvain Modularity Optimization (LMO; Blondel et al., 2008), an algorithm which seeks to find highly-connected "modules" in the graph. The value of k, the number of nearest neighbors, is set to scale logarithmically with the number of spots. An additional cluster-merging step is done: Perform hierarchical clustering on the cluster-medoids in PCA space and merge pairs of sibling clusters if there are no genes differentially expressed between them (with B-H adjusted p-value below 0.05). The hierarchical clustering and merging is repeated until there are no more cluster-pairs to merge. The use of LMO to cluster spots was inspired by a similar method in the R package Seurat.

K-means clustering: Space Ranger also performs K-means clustering across a range of K values, where K is the preset number of clusters.

In order to identify genes whose expression is specific to each cluster, Space Ranger tests, for each gene and each cluster, whether the in-cluster mean differs from the out-of-cluster mean. In order to find differentially expressed genes between groups of spots, Space Ranger uses the quick and simple method sSeq (Yu, Huber, & Vitek, 2013), which employs a negative binomial exact test. When the counts become large, Space Ranger switches to the fast asymptotic beta test used in edgeR (Robinson & Smyth, 2007). For each cluster, the algorithm is run on that cluster versus all other spots, yielding a list of genes that are differentially expressed in that cluster relative to the rest of the sample.

Space Ranger's implementation differs slightly from that in the paper. In the sSeq paper, the authors recommend using DESeq's geometric mean-based definition of library size (Love et al., 2014). Space Ranger instead computes relative library size as the total UMI counts for each spot divided by the median UMI counts per spot. As with sSeq, normalization is implicit in that the per-spot library-size parameter is incorporated as a factor in the exact-test probability calculations.

For Visium v1/v2, Space Ranger quantifies spatial enrichment, measured by Moran's I, as a discovery tool that can be useful to identify features that have distinct patterns of expression (spatial autocorrelation). Moran’s I is not related to differential expression as it is independent from any clustering information. Features with similar Moran's I do not necessarily have similar spatial expression patterns although methods do exist for identifying groups of genes with similar spatial enrichment patterns. The Moran's I metric scale ranges from -1 (perfectly dispersed) to 1 (perfectly enriched). This feature is not supported for Visium HD.

Space Ranger v2.1 and later supports reference-free spot deconvolution on every Visium v1/v2 sample that contains a gene expression library. This feature is not supported for Visium HD.

The algorithm is built on Latent Dirichlet Allocation (LDA) and is similar to STdeconvolve (Miller et al., 2022). Given an initial K = N graph-based clusters + 2, LDA is performed using all gene-expression features with at least 10 UMIs in the sample and spots with at least 100 features detected. Each spot is assigned an estimated proportion from K topics, thought of as cell types or cell type mixtures, which add to 1. Next, hierarchical clustering is performed using average linkage and Manhattan distance. For each K topics until K = 2 is reached, similar branches are collapsed and the sum is taken between each collapsed pair. This allows the user to explore different numbers of topics dependending on a priori knowledge of cell types in their sample and the granularity of cell types they want to deconvolve spots to. Dendrograms named dendrogram_k{N}.png and dendrogram_k{N}\_distances.png show how each K-1 is collapsed and the distance between topics, respectively. For each K, deconvolution_topic_features_k{N}.csv and deconvolved_spots_k{N}.csv are produced. deconvolution_topic_features_k{N}.csv contains a pseudocount of that feature in each topic as well as a log2 fold change calculation between all other topics. This file can be used like a differential expression table to annotate each topic. deconvolved_spots_k{N}.csv shows the proportion of each topic within each spot, representing the estimated mixture of cells within a spot. Space Ranger and Loupe have also streamlined the process of picking the number of topics and annotating topics by designing a GUI-based workflow.

Because of the inherent technical background common to antibody-based assays, 10x Genomics uses isotype controls to normalize the counts of antibody data. For each protein feature in each barcode the UMI counts are divided by 1 + the sum of isotype control UMI counts. These counts are next multiplied by a scale factor of 10,000 and floored to be an integer. These isotype normalized counts are contained within the filtered_feature_bc_matrix.h5 and the filtered_feature_bc_matrix/matrix.mtx.gz files and are what is shown in Loupe Browser for all isotype-normalized protein data. The raw feature-barcode matrices contain the raw antibody UMI counts for each barcode. The isotype_normalization_factors.csv can be multipled with raw counts to obtain the isotype-normalized counts in the filtered feature-barcode matrices.

Protein expression is not supported for Visium HD or Visium v1.

Blondel, V. D., Guillaume, J. L., Lambiotte, R. & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment 2008.

Love, M. L., Huber, W. & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15, 550.

McInnes, L., Healy, J. & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv.

Miller, B.F., Huang, F., Atta, L. et al. (2022). Reference-free cell type deconvolution of multi-cellular pixel-resolution spatially resolved transcriptomics data. Nature Communications 13, 2339.

Musco, C. & Musco, C. (2015). Randomized block Krylov methods for stronger and faster approximate singular value decomposition. Advances in neural information processing systems, 28.

Robinson, M. D. & Smyth, G. K. (2007). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9, 321–332.

Van der Maaten, L. (2014). Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research 15, 3221-3245.

Yu, D., Huber, W. & Vitek, O. (2013). Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size. Bioinformatics 29, 1275–1282.

Zormpas, E., Queen, R., Comber, A. & Cockell, S. J. (2023). Mapping the transcriptome: Realizing the full potential of spatial data analysis. Cell 186, 5677–5689.