spaceranger count pipeline outputs several CSV files which contain automated secondary analysis results. A subset of these results are used to render the Analysis View in the web_summary.html file.
Before clustering, Principal Component Analysis (PCA) is run on the normalized filtered feature-barcode matrix to reduce the number of feature dimensions. Gene expression and antibody features are used as PCA features. Note that the antibody log-transformed counts are used as PCA features. The PCA analysis produces five output files.
projection.csv file contains the projection of each spot onto the first N principal components. By default N=10.
components.csv file is a components matrix which indicates how much each feature contributed (the loadings) to each principal component. Features that were not included in the PCA analysis have all of their loading values set to zero.
features_selected.csv file contains the Ensembl IDs of the features with the highest dispersion that were selected for use in the principal component calculations.
variance.csv file records the proportion of total variance explained by each principal component.
When choosing the number of principal components that are significant, it is useful to look
at the plot of variance explained as a function of PC rank - when the numbers start to flatten out,
subsequent PCs are unlikely to represent meaningful variation in the data.
dispersion.csv file lists the normalized dispersion of each feature, after binning features by their mean expression across the dataset. This provides a useful measure of variability of each feature.
For gene expression, after running PCA, t-distributed Stochastic Neighbor Embedding (t-SNE) is run to visualize spots in a 2-D space. Note that the antibody t-SNE plot is generated using the log-transformed counts.
For gene expression, after running PCA, Uniform Manifold Approximation and Projection (UMAP) is run to visualize spots in a 2-D space. Note that the antibody UMAP plot is generated using the log-transformed counts.
Spots that have similar expression profiles are clustered together based on their projection into PCA space for gene expression and antibody features.
Graph-based clustering (under
graphclust) is run once as it does not require a pre-specified number of clusters. K-means (under
kmeans) is run for many values of K=2,...,N, where K corresponds to the number of clusters, and N=10 by default. The corresponding results for each K value are separated into their own directory.
clustering ├── gene_expression_graphclust ├── gene_expression_kmeans_10_clusters ├── gene_expression_kmeans_2_clusters ├── gene_expression_kmeans_3_clusters ├── gene_expression_kmeans_4_clusters ├── gene_expression_kmeans_5_clusters ├── gene_expression_kmeans_6_clusters ├── gene_expression_kmeans_7_clusters ├── gene_expression_kmeans_8_clusters └── gene_expression_kmeans_9_clusters
For each clustering,
spaceranger produces cluster assignments for each spot.
spaceranger also produces a table indicating which features are differentially expressed in each cluster relative to all other clusters. For each feature we compute three values per cluster:
- The mean UMI counts per spot of this feature in cluster i
- The fold-change of this feature's expression in cluster i relative to all other clusters
- The p-value denoting significance of this feature's expression in cluster i relative to other clusters, adjusted to account for the number of hypotheses (i.e. number of features) being tested.
This is located in a different directory than the clustering results, but follows the same structure, with each clustering separated into its own directory.
$ column -s, -t < analysis/diffexp/gene_expression_kmeans_6_clusters/differential_expression.csv | less -S Feature ID Feature Name Cluster 1 Mean Counts Cluster 1 Log2 fold change Cluster 1 Adjusted p value Cluster 2 Mean Counts Cluster 2 Log2 fold change Cluster 2 Adjusted p value Cluster 3 Mean Counts Cluster 3 Log2 fold change Cluster 3 Adjusted p value Cluster 4 Mean Counts Cluster 4 Log2 fold change Cluster 4 Adjusted p value Cluster 5 Mean Counts Cluster 5 Log2 fold change Cluster 5 Adjusted p value Cluster 6 Mean Counts Cluster 6 Log2 fold change Cluster 6 Adjusted p value ENSG00000187634 SAMD11 0.029518663633325348 -0.6095322929009148 0.15858746256099554 0.03872574484664482 0.22461461249795533 0.9833486026864531 0.05434796766793789 0.7717892513698335 0.5539781031134247 0.04715646443259539 0.5514286743311745 0.5353925121590467 0.06172687981624832 1.220250418944373 0.993462403034926 0 4.571576338266795 1
Data structures produced by Visium can be analyzed and visualized in R or Python. For suggestions on downstream analysis with 3rd party R and Python tools, see the 10x Genomics Analysis Guides resource.