The multiomic secondary analysis results from Cell Ranger ARC involve the following analysis steps:

 PCA for ATAC
 tSNE
 UMAP

 Clustering
 Differential Enrichment Analysis

 Peakmotif Occurence Mappings
And the output of the secondary analysis resides in the out/analysis
directory with the following structure:
analysis
├── clustering
│ ├── atac
│ │ ├── graphclust
│ │ │ ├── clusters.csv
│ │ │ ├── differential_accessibility.csv
│ │ │ └── differential_expression.csv
│ │ ├── kmeans_2_clusters
│ │ │ ├── clusters.csv
│ │ │ ├── differential_accessibility.csv
│ │ │ └── differential_expression.csv
│ │ ├── kmeans_3_clusters
│ │ │ ├── clusters.csv
│ │ │ ├── differential_accessibility.csv
│ │ │ └── differential_expression.csv
│ │ ├── kmeans_4_clusters
│ │ │ ├── clusters.csv
│ │ │ ├── differential_accessibility.csv
│ │ │ └── differential_expression.csv
│ │ └── kmeans_5_clusters
│ │ ├── clusters.csv
│ │ ├── differential_accessibility.csv
│ │ └── differential_expression.csv
│ └── gex
│ ├── graphclust
│ │ ├── clusters.csv
│ │ ├── differential_accessibility.csv
│ │ └── differential_expression.csv
│ ├── kmeans_2_clusters
│ │ ├── clusters.csv
│ │ ├── differential_accessibility.csv
│ │ └── differential_expression.csv
│ ├── kmeans_3_clusters
│ │ ├── clusters.csv
│ │ ├── differential_accessibility.csv
│ │ └── differential_expression.csv
│ ├── kmeans_4_clusters
│ │ ├── clusters.csv
│ │ ├── differential_accessibility.csv
│ │ └── differential_expression.csv
│ └── kmeans_5_clusters
│ ├── clusters.csv
│ ├── differential_accessibility.csv
│ └── differential_expression.csv
├── dimensionality_reduction
│ ├── atac
│ │ ├── lsa_components.csv
│ │ ├── lsa_dispersion.csv
│ │ ├── lsa_features_selected.csv
│ │ ├── lsa_projection.csv
│ │ ├── lsa_variance.csv
│ │ ├── tsne_projection.csv
│ │ └── umap_projection.csv
│ └── gex
│ ├── pca_components.csv
│ ├── pca_dispersion.csv
│ ├── pca_features_selected.csv
│ ├── pca_projection.csv
│ ├── pca_variance.csv
│ ├── tsne_projection.csv
│ └── umap_projection.csv
├── feature_linkage
│ ├── feature_linkage.bedpe
│ └── feature_linkage_matrix.h5
└── tf_analysis
├── filtered_tf_bc_matrix
│ ├── barcodes.tsv.gz
│ ├── matrix.mtx.gz
│ └── motifs.tsv
├── filtered_tf_bc_matrix.h5
└── peak_motif_mapping.bed
PCA/LSA
The primary dimensionality reduction method is Principal Component Analysis (PCA) for GEX and Latent Semantic Analysis (LSA) for ATAC. PCA is run on the normalized filtered genebarcode matrix to reduce the number of feature (gene) dimensions. Only gene expression features are used as PCA features. Likewise, LSA is run on the normalized filtered peakbarcode matrix. PCA (LSA) analysis produces four output files in the directory analysis/dimensionality_reduction/gex/
with prefix of pca_
(lsa_
) in the subdirectory gex/
(atac/
).
The first is a projection of each cell onto the first N principal components (default GEX: N=10; ATAC: N=15).
cd /home/jdoe/runs/sample345/outs
head 2 analysis/dimensionality_reduction/gex/pca_projection.csv
Barcode,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
AAACAGCCAAGCTAAA1,17.688234040781857,3.7950159394896508,0.12134779569343124,8.891889169739237,1.6561792607584174,3.3562135574248586,2.1045793835246203,5.304589200171137,0.5285869980603226,2.316716491709393
head 2 analysis/dimensionality_reduction/atac/lsa_projection.csv
Barcode,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15
AAACAGCCAAGCTAAA1,20.240988995299652,9.98961195205192,3.975713841955313,3.6970519526233816,0.5924742121181492,0.2541630680205914,1.8285930634181444,1.3091645487857684,0.1932357739169616,0.09950491463448573,1.3779137917059847,1.5110824109207137,0.421621592950534,0.0952461164327349,0.20614805513560971
The second file is a components matrix which indicates how much each feature contributed (the loadings) to each principal component. Features that were not included in the PCA analysis have all of their loading values set to zero.
head 2 analysis/dimensionality_reduction/gex/pca_components.csv
PC,ENSG00000228327,ENSG00000237491,ENSG00000177757,ENSG00000225880,...,ENSG00000160310
1,0.0044,0.0039,0.0024,0.0016,...,0.0104
head 2 analysis/dimensionality_reduction/atac/lsa_components.csv
PC,chr1:96951439697582,chr1:96982129701041,...
1,0.5482991923678618,0.6374211593177428,...
The third file contains the Ensembl IDs of the features for GEX and peaks for ATAC with the highest dispersion that were selected for use in the principal component calculations.
head 5 analysis/dimensionality_reduction/gex/pca_features_selected.csv
Feature
1,ENSG00000160223
2,ENSG00000142178
3,ENSG00000160221
4,ENSG00000272825
head 5 analysis/dimensionality_reduction/atac/lsa_features_selected.csv
Feature
1,chr21:3934884339349752
2,chr21:3354253633543357
3,chr21:4200976242010706
4,chr21:3938841839389224
The fourth file records the proportion of total variance explained by each principal component. When choosing the number of principal components that are significant, it is useful to look at the plot of variance explained as a function of PC rank  when the numbers start to flatten out, subsequent PCs are unlikely to represent meaningful variation in the data.
head 5 analysis/dimensionality_reduction/gex/pca_variance.csv
PC,Proportion.Variance.Explained
1,0.01009176733941617
2,0.0031696809558130652
3,0.002391878968412864
4,0.0020683529204892654
head 5 analysis/dimensionality_reduction/atac/lsa_variance.csv
PC,Proportion.Variance.Explained
1,0.2210095789548742
2,0.03476394600838236
3,0.005925095778867349
4,0.003582945659343343
The final file lists the normalized dispersion of each feature, after binning features by their mean expression across the dataset. This provides a useful measure of variability of each feature.
head 5 analysis/dimensionality_reduction/gex/pca_dispersion.csv
Feature,Normalized.Dispersion
ENSG00000228327,2.0138970131886671
ENSG00000237491,1.3773662040549017
ENSG00000177757,0.28102027567224191
ENSG00000225880,1.9887312950109921
head 5 analysis/dimensionality_reduction/atac/lsa_dispersion.csv
Feature,Normalized.Dispersion
chr1:96951439697582,0.02029960904777695
chr1:96982129701041,0.10379770925583033
chr1:98252539827762,1.0
chr1:98297469830116,25.528012093307737
tSNE
After running PCA or LSA, tdistributed Stochastic Neighbor Embedding (tSNE) is run to visualize cells in a 2D space.
head 5 analysis/dimensionality_reduction/atac/umap_projection.csv
Barcode,TSNE1,TSNE2
AAACAGCCAAGCTAAA1,9.2136315704327,5.795182388646322
AAACAGCCAAGGTAAC1,10.9596148671472,17.742914355441265
AAACAGCCAGTAGGTG1,2.869065977385947,17.55872285065259
AAACAGCCATAATGTC1,12.495664530357228,1.9561615760448785
head 5 analysis/dimensionality_reduction/gex/umap_projection.csv
Barcode,TSNE1,TSNE2
AAACAGCCAAGCTAAA1,11.19783100504234,32.672655215753544
AAACAGCCAAGGTAAC1,24.09848935339985,0.6469769415490979
AAACAGCCAGTAGGTG1,4.678939739563926,27.328395716680745
AAACAGCCATAATGTC1,21.49243070779123,27.122233774496824
UMAP
After running PCA or LSA, Uniform Manifold Approximation and Projection (UMAP) is run to visualize cells in a 2D space.
head 5 analysis/dimensionality_reduction/atac/umap_projection.csv
Barcode,UMAP1,UMAP2
AAACAGCCAAGCTAAA1,7.3394675,5.621648
AAACAGCCAAGGTAAC1,7.112387,0.9921901
AAACAGCCAGTAGGTG1,4.2560987,6.852242
AAACAGCCATAATGTC1,6.77568,5.4857235
head 5 analysis/dimensionality_reduction/gex/umap_projection.csv
Barcode,UMAP1,UMAP2
AAACAGCCAAGCTAAA1,7.312935,7.3619266
AAACAGCCAAGGTAAC1,8.567425,1.38729
AAACAGCCAGTAGGTG1,8.221492,6.5541673
AAACAGCCATAATGTC1,5.5689363,7.2709103
The ATAC and GEX data per cell barcode is sparse and clustering the data using the large number of features can help discover different cell populations in the sample. Moreover, the clustering helps us detect differentially accessible peaks or differentially expressed genes in each population.
Clustering is then run to group cells together that have similar expression profiles, based on their projection into PCA space (GEX) or LSA space (ATAC). Graphbased clustering (under graphclust) is run once as it does not require a prespecified number of clusters. Kmeans (under kmeans) is run for many values of K=2,...,N, where K corresponds to the number of clusters (default N=5).
ls analysis/clustering/atac
graphclust kmeans_2_clusters kmeans_3_clusters
kmeans_4_clusters kmeans_5_clusters
For each clustering, cellrangerarc
produces cluster assignments for each cell.
head 5 analysis/clustering/atac/kmeans_3_clusters/clusters.csv
Barcode,Cluster
AAACATACAACGAA1,2
AAACATACTACGCA1,2
AAACCGTGTCTCGC1,1
AAACGCACAACCAC1,3
head 5 analysis/clustering/gex/kmeans_3_clusters/clusters.csv
Barcode,Cluster
AAACATACAACGAA1,2
AAACATACTACGCA1,2
AAACCGTGTCTCGC1,1
AAACGCACAACCAC1,3
For each clustering setting generated for either ATAC or GEX matrix and by either Kmeans or graph clustering method, cellrangerarc
then produces a table indicating which genes are differentially expressed (differential_expression.csv
) and a table indicating which peaks and transcription factor motifs are differentially accessible (differential_accessibility.csv
) in each cluster relative to all other clusters, as the algorithm describes. For each feature, whether it is a gene, peak, or transcription factor motif, we compute these three values per cluster:
 The mean UMI counts per cell of this feature in cluster i
 The log2 foldchange of this feature's expression in cluster i relative to all other clusters
 The pvalue denoting the significance of this feature's expression in cluster i relative to other clusters, adjusted to account for the number of hypotheses (i.e. the number of features) being tested
Both differential_expression.csv
and differential_accessibility.csv
are located in the same directory as the clustering results.
head 5 analysis/clustering/atac/graphclust/differential_expression.csv
Feature ID,Feature Name,Cluster 1 Mean UMI Counts,Cluster 1 Log2 fold change,Cluster 1 Adjusted p value,Cluster 2 Mean UMI Counts,Cluster 2 Log2 fold change,Cluster 2 Adjusted p value,Cluster 3 Mean UMI Counts,Cluster 3 Log2 fold change,Cluster 3 Adjusted p value
ENSG00000228327,RP11206L10.2,0.0056858989363338264,2.6207666981569986,0.00052155805898912184,0.0,0.75299726644507814,0.64066099091888962,0.00071455453829430329,2.3725403666493312,0.0043023680184636837
ENSG00000237491,RP11206L10.9,0.00012635330969630726,0.31783275717885928,0.40959138980118809,0.0,3.8319652342760779,0.11986963938734894,0.0,0.56605908868652577,0.39910771338768203
ENSG00000177757,FAM87B,0.0,2.9027952579000154,0.0,0.0,3.2470027335549219,0.19129034227967889,0.00071455453829430329,3.1510215894076818,0.0
ENSG00000225880,LINC00115,0.0003790599290889218,5.71015017995762,8.4751637615375386e28,0.20790015775229512,7.965820981010868,1.3374521290889345e46,0.0017863863457357582,2.2065304152104019,0.00059189960914085744
head 5 analysis/clustering/atac/graphclust/differential_accessibility.csv
Feature ID,Feature Name,Cluster 1 Mean Counts,Cluster 1 Log2 fold change,Cluster 1 Adjusted p value,Cluster 2 Mean Counts,Cluster 2 Log2 fold change,Cluster 2 Adjusted p value,Cluster 3 Mean Counts,Cluster 3 Log2 fold change,Cluster 3 Adjusted p value
chr1:96951299697582,chr1:96951299697582,0.014098403818774368,5.823451487250574,2.2659671842098193e06,4.185745651762137e09,1.3874516676069444,0.5918812904596457,1.9512762483589925,7.238430090771634,5.00258305609651e09
chr1:96982109701041,chr1:96982109701041,0.013761153212430422,6.1502095503083165,7.855686702156565e07,0.046489553517204636,3.0232327143356246,0.01647646310191049,2.2844378973176838,6.5025499776936115,4.703658999567952e13
.
.
.
AHR_HUMAN.H11MO.0.B,AHR_HUMAN.H11MO.0.B,1.5229979744677225e09,0.558490289359965,1.0,1.5229979744575502e09,1.41990325445066,1.0,1.5229979744838465e09,2.5
360529002402097,1.0
AIRE_HUMAN.H11MO.0.C,AIRE_HUMAN.H11MO.0.C,382.4895824324451,1.366896997726535,0.007214824200990991,4098.191143669588,0.031632664734601475,1.0,124.229272550
17468,2.136369782757689,0.0015585067057439586
Notice that the table differential_accessibility.csv
for any specific clustering includes differential analysis results for both peaks and transcription factor motifs.
The feature_linkage.bedpe
file in outs/analysis/feature_linkage
is a tabdelimited file containing information of feature linkages inferred from the pipeline. It follows the BEDPE specification from bedtools and can be directly loaded to the Integrative Genome Viewer (IGV). See the Feature Linkage Algorithm page for details on how Cell Ranger ARC produces feature linkages.
head 5 analysis/feature_linkage/feature_linkage.bedpe
chr1 817064 817593 chr1 998050 998051 <FAM87B_promoter><AL645608.7> 0.3074 . . 7.3085 180722 peakgene
chr1 906622 907202 chr1 998050 998051 <AL645608.6_distal><AL645608.7> 0.3544 . . 6.1586 91138 peakgene
chr1 817064 817593 chr1 999980 1000172 <FAM87B_promoter><HES4> 0.4095 . . 13.1158 182747 peakgene
chr1 906622 907202 chr1 999980 1000172 <AL645608.6_distal><HES4> 0.4341 . . 15.8455 93164 peakgene
The columns are defined as follows:
Column Number  Name  Description 

1  chrom1  The name of the chromosome on which the first end of the feature exists. 
2  start1  The zerobased starting position of the first end of the feature on chrom1. 
3  end1  The zerobased ending position of the first end of the feature on chrom1. 
4  chrom2  The name of the chromosome on which the second end of the feature exists. 
5  start2  The zerobased starting position of the second end of the feature on chrom2. 
6  end2  The zerobased ending position of the second end of the feature on chrom2. 
7  name  Defines the name of the linkage with the format of <name1><name2>, in which name1 and name2 are based on gene symbol or peak annotation. 
8  score  Linkage correlation, ranging from 1 to 1. 
9  strand1  Set to ".". 
10  strand2  Set to ".". 
11  significance  Linkage significance: log10 pvalue after multiple testing correction (false discovery rate). Capped at 299. 
12  distance  Distance in base pairs from feature 2 to feature 1. 
13  linkage_type  Indicates the correlation between an ATAC feature (a peak) and a GEX feature (a gene). It can be one of the following: 
The distance between features in a feature linkage is defined as follows:
 For linkages between a gene and a peak: the base pair between the transcription start site (TSS) and the center of the peak. When a gene has multiple TSS, the position of TSS is defined as the center between the leftmost TSS and rightmost TSS.
 For linkages between two peaks: the base pairs between the centers of the two peaks.
Note that linkage distance can be positive or negative. Positive distance means the genomic coordinates are larger in feature 2 than in feature 1. Because the symmetric nature of feature linkage, only linkages with positive or zero distances are output to feature_linkage.bedpe
.
The feature_linkage_matrix.h5
file is a compressed HDF5 file containing the sparse matrices of feature linkage correlation and significance, as well as the feature references. The file hierarchy is as follows:
(root)
├── score
├── significance
├── indices
├── indptr
└── features [HDF5 group]
├─ _all_tag_keys
├─ feature_type
├─ genome
├─ id
├─ interval
└─ name
and the member specifications are as follows:
Column  Type  Description 

score  float64  Linkage correlation, ranging from 1 to 1. 
significance  float64  Linkage significance: log10 pvalue after multiple testing correction (false discovery rate). Capped at 299. 
indices  int64  CSR format index array of the matrix. 
indptr  int64  CSR format index pointer array of the matrix. 
feature_type  string  The type of feature reference to which this feature belongs (Gene Expression or Peaks). 
genome  string  The genome reference for a given feature (e.g., "GRCh38" or "mm10"). For nongene expression features, this entry is an empty string. 
id  string  The unique id corresponding to this feature (Ensembl gene IDs for genes or peak coordinates for peaks). 
interval  string  Specifies TSS coordinates for genes, or peak coordinates for peaks. 
name  string  A humanreadable name associated with this feature (gene symbol for gene features and peak coordinates for peak features). 
The HDF5 group features
contains information regarding the feature reference(s) used for the analysis. The datasets within the features
group represent columns in a table containing one row per feature. Values in the feature_idx
column described in the previous section provide indices into the rows of this hypothetical table.
The linkage correlation and linkage significance matrices are n_feature x n_feature sparse matrices sharing the same sparsity pattern, which is defined by indices
and indptr
.
Cell Ranger ARC performs a motif scan on peaks and generates a motifbarcode matrix. The output files are located at analysis/tf_analysis
, including
 HDF5
filtered_tf_bc_matrix.h5
and MEXfiltered_tf_bc_matrix
, following the same format of joint featurebarcode matrix  Peakmotif occurrence mappings BED
peak_motif_mapping.bed
tf_analysis
├── filtered_tf_bc_matrix
│ ├── barcodes.tsv.gz
│ ├── matrix.mtx.gz
│ └── motifs.tsv
├── filtered_tf_bc_matrix.h5
└── peak_motif_mapping.bed
The peak_motif_mapping.bed
file is a BED file containing peak coordinates and motif names as the fourth column. Each row represents the occurrence of one motif in one peak as evidenced by the motif scan; a single peak can occur multiple times associated with different motifs.
head 5 analysis/tf_analysis/peak_motif_mapping.bed
chr1 629732 630166 MAFG::NFE2L1_MA0089.1
chr1 629732 630166 Sox5_MA0087.1
chr1 633796 634260 SHOX_MA0630.1
chr1 633796 634260 VAX2_MA0723.1
chr1 633796 634260 Sox5_MA0087.1