The count
, multi
, aggr
, and reanalyze
pipelines output several CSV files which contain automated secondary analysis results. A subset of these results are used to render the Analysis view in the count run summary and Cells and Library views in the multi run summary.
Before clustering the cells, Principal Component Analysis (PCA) is run on the normalized filtered feature-barcode matrix to reduce the number of feature (gene) dimensions. Only gene expression features are used as PCA features. The PCA analysis produces five output files. The first is a projection of each cell onto the first N principal components. By default N=10 (N=100 when chemistry batch correction is enabled); when running reanalyze
, you can choose to increase it.
cd /home/jdoe/runs/sample345/outs
head -2 analysis/pca/gene_expression_10_components/projection.csv
Barcode,PC-1,PC-2,PC-3,PC-4,PC-5,PC-6,PC-7,PC-8,PC-9,PC-10
AAACAAGCACCATACT-1,18.55496347631502,-8.428877305709332,3.7717969735420835,-0.61215157678172,-1.0987614379684771,2.194733668965279,-2.6595895212967386,-2.8703699622639114,1.867229094193604,0.2658532968798859
The second file is a components matrix which indicates how much each feature contributed (the loadings) to each principal component. Features that were not included in the PCA analysis have all of their loading values set to zero.
head -2 analysis/pca/gene_expression_10_components/components.csv
PC,ENSG00000228327,ENSG00000237491,ENSG00000177757,ENSG00000225880,...,ENSG00000160310
1,-0.0044,0.0039,-0.0024,-0.0016,...,-0.0104
The third file contains the Ensembl IDs of the features with the highest dispersion that were selected for use in the principal component calculations.
head -5 analysis/pca/gene_expression_10_components/features_selected.csv
Feature
1,ENSG00000167723
2,ENSG00000179029
3,ENSG00000196544
4,ENSG00000141499
The fourth file records the proportion of total variance explained by each principal component. When choosing the number of principal components that are significant, it is useful to look at the plot of variance explained as a function of PC rank - when the numbers start to flatten out, subsequent PCs are unlikely to represent meaningful variation in the data.
head -5 analysis/pca/gene_expression_10_components/variance.csv
PC,Proportion.Variance.Explained
1,0.0056404970744118104
2,0.0038897311237809061
3,0.0028803714818085419
4,0.0020830581822081206
The final file lists the normalized dispersion of each feature, after binning features by their mean expression across the dataset. This provides a useful measure of variability of each feature.
head -5 analysis/pca/gene_expression_10_components/dispersion.csv
Feature,Normalized.Dispersion
ENSG00000228327,2.0138970131886671
ENSG00000237491,1.3773662040549017
ENSG00000177757,-0.28102027567224191
ENSG00000225880,1.9887312950109921
After running PCA, t-distributed Stochastic Neighbor Embedding (t-SNE) is run to visualize cells in a 2-D space.
head -5 analysis/tsne/gene_expression_2_components/projection.csv
Barcode,TSNE-1,TSNE-2
AAACATACAACGAA-1,-13.5494,1.4674
AAACATACTACGCA-1,-2.7325,-10.6347
AAACCGTGTCTCGC-1,12.9590,-1.6369
AAACGCACAACCAC-1,-9.3585,-6.7300
After running PCA, Uniform Manifold Approximation and Projection (UMAP) is run to visualize cells in a 2-D space.
head -5 analysis/umap/gene_expression_2_components/projection.csv
Barcode,UMAP-1,UMAP-2
AAACCTGAGAATAGGG-1,0.5974335,1.320372
AAACCTGAGAGCTGGT-1,2.2277818,-0.52756095
AAACCTGAGCGTTGCC-1,2.675832,1.1010709
AAACCTGCACGGACAA-1,2.7049212,-3.1494563
Clustering is then run to group cells together that have similar expression profiles, based on their projection into PCA space. Graph-based clustering (under graphclust
) is run once as it does not require a pre-specified number of clusters. K-means (under kmeans
) is run for many values of K=2,...,N where K corresponds to the number of clusters. By default N=10; when running reanalyze
, you can choose to increase it. The corresponding results for each K is separated into its own directory.
ls analysis/clustering
gene_expression_graphclust
gene_expression_kmeans_10_clusters
gene_expression_kmeans_2_clusters
gene_expression_kmeans_3_clusters
gene_expression_kmeans_4_clusters
gene_expression_kmeans_5_clusters
gene_expression_kmeans_6_clusters
gene_expression_kmeans_7_clusters
gene_expression_kmeans_8_clusters
gene_expression_kmeans_9_clusters
For each clustering, cellranger
produces cluster assignments for each cell.
head -5 analysis/clustering/gene_expression_kmeans_3_clusters/clusters.csv
Barcode,Cluster
AAACAAGCACCATACT-1,1
AAACAAGCACGTAATG-1,1
AAACAAGCATGCAATG-1,1
AAACAAGCATTTGGGA-1,1
cellranger
also produces a table indicating which features are differentially expressed in each cluster relative to all other clusters. For each feature and each cluster i, we compute three values:
- The mean expression of this feature in cluster i (i.e., across cells assigned to cluster i)
- The log2 fold-change of this feature's mean expression in cluster i relative to all other cells
- A p-value denoting significance of this feature's expression in cluster i relative to cells in other clusters. P-values within each cluster are adjusted for false discovery rate to account for the number of hypotheses (i.e., number of features) being tested.
For details on the mean expression normalization and statistical test, see algorithms.
Differential expression results are located in a different directory than the clustering results, but follows the same structure, with each clustering separated into its own directory.
head -5 analysis/diffexp/gene_expression_kmeans_3_clusters/differential_expression.csv
Feature ID,Feature Name,Cluster 1 Mean UMI Counts,Cluster 1 Log2 fold change,Cluster 1 Adjusted p value,Cluster 2 Mean UMI Counts,Cluster 2 Log2 fold change,Cluster 2 Adjusted p value,Cluster 3 Mean UMI Counts,Cluster 3 Log2 fold change,Cluster 3 Adjusted p value
ENSG00000228327,RP11-206L10.2,0.0056858989363338264,2.6207666981569986,0.00052155805898912184,0.0,-0.75299726644507814,0.64066099091888962,0.00071455453829430329,-2.3725403666493312,0.0043023680184636837
ENSG00000237491,RP11-206L10.9,0.00012635330969630726,-0.31783275717885928,0.40959138980118809,0.0,3.8319652342760779,0.11986963938734894,0.0,0.56605908868652577,0.39910771338768203
ENSG00000177757,FAM87B,0.0,-2.9027952579000154,0.0,0.0,3.2470027335549219,0.19129034227967889,0.00071455453829430329,3.1510215894076818,0.0
ENSG00000225880,LINC00115,0.0003790599290889218,-5.71015017995762,8.4751637615375386e-28,0.20790015775229512,7.965820981010868,1.3374521290889345e-46,0.0017863863457357582,-2.2065304152104019,0.00059189960914085744
If you analyzed a multi-species experiment, the analysis output will look different. For example, the human-mouse mixing experiment is run to verify system functionality. It consists of mixing approximately 600 human (HEK293T) cells and 600 mouse (3T3) cells in a 1:1 ratio.
cellranger
produces a single analysis CSV file indicating whether each GEM contains only a single human cell (hg19), a single mouse cell (mm10) or multiple mouse and human cells (Multiplet).
cd /home/jdoe/runs/sample345/outs
head -5 analysis/gem_classification.csv
barcode,hg19,mm10,call
AAACATACACCTCC-1,3,815,mm10
AAACATACACCTGA-1,14,780,mm10
AAACATACACGTGT-1,2,439,mm10
AAACATACAGACTC-1,700,776,Multiplet