Space Ranger Secondary Analysis CSV

The spaceranger count pipeline outputs several CSV files which contain automated secondary analysis results. A subset of these results are used to render the Analysis View in the web_summary.html file.

Before clustering, Principal Component Analysis (PCA) is run on the normalized filtered feature-barcode matrix to reduce the number of feature dimensions. Gene expression and antibody features are used as PCA features. Note that the antibody log-transformed counts are used as PCA features. The PCA analysis produces five output files.


analysis/pca
└── gene_expression_10_components
    ├── components.csv
    ├── dispersion.csv
    ├── features_selected.csv
    ├── projection.csv
    └── variance.csv

The projection.csv file contains the projection of each spot onto the first N principal components. By default N=10.


$ cd /home/jdoe/runs/sample345/outs
$ column -s, -t < analysis/pca/gene_expression_10_components/projection.csv | less -S

Barcode             PC-1                   PC-2                    PC-3                   PC-4                   PC-5                    PC-6                    PC-7                   PC-8                    PC-9                    PC-10
AACACTTGGCAAGGAA-1  4.346822844040275      -9.073988527954281      -3.9348855477667715    4.4143616349096835     0.570902992727801       5.916871998370152       2.480636375841689      -0.06798408872536493    -0.19559617312320177    1.8447556106163412
AACAGGATTCATAGTT-1  -1.615594647200382     -1.4893042055593746     7.5739700779328665     -3.594441916107372     -0.34089358717427726    2.111157673540723       0.7226241085802059     -3.9462479306752436     0.7109160992468775      0.2148672225802757
AACAGGTTATTGCACC-1  11.032392266516446     -8.48766121740853       -3.061209741692746     -1.0179508777455186    -0.3086495689242125     -1.7476955635612388     -4.667269353092443     3.0867661655728873      4.077976646698517       3.4325955564744035
AACAGGTTCACCGAAG-1  0.02261690362615809    -1.1836459670547157     -0.4219683969014265    -0.9704969551004782    0.042818261398003474    0.7016418174052369      0.5984518607384657     0.4370020158231471      5.6108084569945715      -0.5928326084763261
AACAGTCAGGCTCCGC-1  23.551530490594487     1.485566122772231       -4.061849114221165     -3.572810445316029     0.7253401628543874      8.335238428414028       -0.27411229186554853   -1.419600005890016      8.151194312679634       -0.4650714219420635

The components.csv file is a components matrix which indicates how much each feature contributed (the loadings) to each principal component. Features that were not included in the PCA analysis have all of their loading values set to zero.


$ head -2 analysis/pca/gene_expression_10_components/components.csv

PC,ENSG00000228327,ENSG00000237491,ENSG00000177757,ENSG00000225880,...,ENSG00000160310
1,-0.0044,0.0039,-0.0024,-0.0016,...,-0.0104

The features_selected.csv file contains the Ensembl IDs of the features with the highest dispersion that were selected for use in the principal component calculations.


$ column -s, -t < analysis/pca/gene_expression_10_components/features_selected.csv | less -S

Feature
1        ENSMUSG00000114038
2        ENSMUSG00000058063
3        ENSMUSG00000087216
4        ENSMUSG00000085244
5        ENSMUSG00000021604

The variance.csv file records the proportion of total variance explained by each principal component. When choosing the number of principal components that are significant, it is useful to look at the plot of variance explained as a function of PC rank - when the numbers start to flatten out, subsequent PCs are unlikely to represent meaningful variation in the data.


$ column -s, -t < analysis/pca/gene_expression_10_components/variance.csv | less -S

PC  Proportion.Variance.Explained
1   0.006020454455283148
2   0.0014744138318528535
3   0.0012400447266735174
4   0.0009462466452900335
5   0.0009012382233475119
6   0.0008795663315577918
7   0.0008772635528060896
8   0.0008770449415125795
9   0.0008671600964701859
10  0.0008598483035027898

The dispersion.csv file lists the normalized dispersion of each feature, after binning features by their mean expression across the dataset. This provides a useful measure of variability of each feature.


$ column -s, -t < analysis/pca/gene_expression_10_components/dispersion.csv | less -S

Feature          Normalized.Dispersion
ENSG00000187634  0.6831683505253648
ENSG00000188976  -0.14721475503619233
ENSG00000187961  2.2333235330589933
ENSG00000187583  -0.1377803092462445
ENSG00000187642  -0.4131854711145404
ENSG00000188290  -0.6689923111662834
ENSG00000187608  -1.0069025521553716
ENSG00000188157  0.1691687357833229
ENSG00000237330  2.0109141055507394
ENSG00000131591  -1.4170406794742954
ENSG00000162571  2.501396789174146

For gene expression, after running PCA, t-distributed Stochastic Neighbor Embedding (t-SNE) is run to visualize spots in a 2-D space. Note that the antibody t-SNE plot is generated using the log-transformed counts.


$ column -s, -t < analysis/tsne/gene_expression_2_components/projection.csv | less -S

Barcode             TSNE-1                 TSNE-2
AACACTTGGCAAGGAA-1  1.2672117740192608     25.047625819665186
AACAGGATTCATAGTT-1  0.04778171834588573    4.016509598383599
AACAGGTTATTGCACC-1  18.80364109918134      18.684080610445474
AACAGGTTCACCGAAG-1  1.99715394789933       -9.208697881938745
AACAGTCAGGCTCCGC-1  38.15012452500775      2.0611329330125514
AACAGTCCACGCGGTG-1  -1.9209290038167077    -32.80566322209981
AACATAGTCTATCTAC-1  24.641739427754395     4.132453609694308
AACATCTTAAGGCTCA-1  22.693280619738776     -4.616978161185022
AACCAATCTGGTTGGC-1  5.883220436323025      -20.80497990643471
AACCACTGCCATAGCC-1  -8.471808255953594     -12.06184466119581
AACCAGAATCAGACGT-1  11.670881660483042     4.385137546311761

For gene expression, after running PCA, Uniform Manifold Approximation and Projection (UMAP) is run to visualize spots in a 2-D space. Note that the antibody UMAP plot is generated using the log-transformed counts.


$ column -s, -t < analysis/umap/gene_expression_2_components/projection.csv | less -S

Barcode             UMAP-1              UMAP-2
AACACTTGGCAAGGAA-1  10.310660096919259  7.813228392659608
AACAGGATTCATAGTT-1  9.20511225151223    5.568023946107357
AACAGGTTATTGCACC-1  12.291062284438889  6.940462987013961
AACAGGTTCACCGAAG-1  9.032031636927861   7.064727855092599
AACAGTCAGGCTCCGC-1  13.326524472133555  4.742776277209383
AACAGTCCACGCGGTG-1  9.27174981223149    0.7703902647873845
AACATAGTCTATCTAC-1  11.73081758571688   4.510761419587083
AACATCTTAAGGCTCA-1  11.816231548622202  3.0618744238318683
AACCAATCTGGTTGGC-1  8.917922202723922   1.723589437141921
AACCACTGCCATAGCC-1  8.090373623491763   3.0793685741491017
AACCAGAATCAGACGT-1  12.501168063179637  5.342741923918538
AACCGCCAGACTACTT-1  8.042049094337923   2.931341403074622

Spots that have similar expression profiles are clustered together based on their projection into PCA space for gene expression and antibody features.

Graph-based clustering (under graphclust) is run once as it does not require a pre-specified number of clusters. K-means (under kmeans) is run for many values of K=2,...,N, where K corresponds to the number of clusters, and N=10 by default. The corresponding results for each K value are separated into their own directory.


clustering
├── gene_expression_graphclust
├── gene_expression_kmeans_10_clusters
├── gene_expression_kmeans_2_clusters
├── gene_expression_kmeans_3_clusters
├── gene_expression_kmeans_4_clusters
├── gene_expression_kmeans_5_clusters
├── gene_expression_kmeans_6_clusters
├── gene_expression_kmeans_7_clusters
├── gene_expression_kmeans_8_clusters
└── gene_expression_kmeans_9_clusters

For each clustering, spaceranger produces cluster assignments for each spot.


$ column -s, -t < analysis/clustering/gene_expression_kmeans_6_clusters/clusters.csv | less -S

Barcode             Cluster
AACACTTGGCAAGGAA-1  2
AACAGGATTCATAGTT-1  5
AACAGGTTATTGCACC-1  3
AACAGGTTCACCGAAG-1  2
AACAGTCAGGCTCCGC-1  6
AACAGTCCACGCGGTG-1  4
AACATAGTCTATCTAC-1  6
AACATCTTAAGGCTCA-1  4
AACCAATCTGGTTGGC-1  4
AACCACTGCCATAGCC-1  2
AACCAGAATCAGACGT-1  3
AACCGCCAGACTACTT-1  2

spaceranger also produces a table indicating which features are differentially expressed in each cluster relative to all other clusters. For each feature we compute three values per cluster:

The mean UMI counts per spot of this feature in cluster i
The $log_{2}$ fold-change of this feature's expression in cluster i relative to all other clusters
The p-value denoting significance of this feature's expression in cluster i relative to other clusters, adjusted to account for the number of hypotheses (i.e. number of features) being tested.

This is located in a different directory than the clustering results, but follows the same structure, with each clustering separated into its own directory.


$ column -s, -t < analysis/diffexp/gene_expression_kmeans_6_clusters/differential_expression.csv | less -S

Feature ID       Feature Name  Cluster 1 Mean Counts  Cluster 1 Log2 fold change  Cluster 1 Adjusted p value          Cluster 2 Mean Counts  Cluster 2 Log2 fold change  Cluster 2 Adjusted p value                  Cluster 3 Mean Counts  Cluster 3 Log2 fold change  Cluster 3 Adjusted p value  Cluster 4 Mean Counts  Cluster 4 Log2 fold change  Cluster 4 Adjusted p value                            Cluster 5 Mean Counts  Cluster 5 Log2 fold change  Cluster 5 Adjusted p value  Cluster 6 Mean Counts  Cluster 6 Log2 fold change  Cluster 6 Adjusted p value
ENSG00000187634  SAMD11        0.029518663633325348   -0.6095322929009148         0.15858746256099554                 0.03872574484664482    0.22461461249795533         0.9833486026864531                          0.05434796766793789    0.7717892513698335          0.5539781031134247          0.04715646443259539    0.5514286743311745          0.5353925121590467                                    0.06172687981624832    1.220250418944373           0.993462403034926           0                      4.571576338266795           1

Data structures produced by Visium can be analyzed and visualized in R or Python. For suggestions on downstream analysis with 3rd party R and Python tools, see the 10x Genomics Analysis Guides resource.

Dimensionality reduction

t-SNE

UMAP

Clustering

Differential expression

Downstream analysis in R and Python