Cell type annotation refers to the process of categorizing and assigning cell types to individual cells based on their gene expression profiles. These annotations are needed for understanding the cellular composition and diversity within a sample.
For more details about the required input files and how to run the pipeline, see the Cell Type Annotation page. Learn more about the annotation algorithms here.
When running cellranger-arc count, the following output files are generated:
outs/
├── cell_types
│ ├── Azimuth # for Pan-Human Azimuth annotation model
│ │ ├── cell_annotation_differential_expression.csv
│ │ └── cell_types.csv
│ └── 10x_Cloud # for cloud-based annotation models
│ ├── cell_annotation_differential_expression.csv
│ ├── cell_annotation_results.json.gz
│ └── cell_types.csv
Starting in Cell Ranger ARC v2.2, the web_summary.html file includes a tab per annotation model when it is run with the count pipeline. See descriptions and interpretation guidance here.
| File Name | Description |
|---|---|
cell_annotation_results.json.gz | Detailed evidence of how each cell has been assigned a cell type by the algorithm, broken down by dataset IDs in the reference database and nearest-neighbors in each. |
cell_types.csv | A CSV file listing coarse and fine cell types for each cell. |
cell_annotation_differential_expression.csv | Table listing genes differentially expressed in each detected cell type, along with log2 fold-change and associated p-value. |
File Name: cell_annotation_results.json.gz
Description: This file is a compressed JSON containing a list of dictionaries. Each element in the list represents the annotation results from a single barcode, derived from the cell annotation model.
For each barcode, the corresponding dictionary includes the top 500 matches obtained using an approximate-Nearest Neighbor (ANN) lookup. These matches are summarized for the total number of occurrences for a given cell type. While more cells supporting a particular annotation can increase your confidence in the annotation, occasionally the most common nearest-neighbor cell type can have a low number of supporting cells because the nearest-neighbors are split amongst several highly similar cell types (e.g., 'Cd16-Negative, Cd56-Bright Natural Killer Cell, Human' and 'Cd16-Negative, Cd56-Dim Natural Killer Cell'). The dataset_id corresponds to the Chan Zuckerberg CELL by GENE (CZ CELLxGENE) study from which the annotation was derived. To view this study, insert the id into this URL: https://cellxgene.cziscience.com/e/{dataset_id}.cxg/.
An example output is shown below:
{
"barcode": "AAACCAAAGAATGCAA-1",
"matches": [
{
"cell_count_in_model": 32,
"cell_type": "monocyte",
"dataset_ids_with_counts": [
{
"count_per_dataset": 30,
"dataset_id": "87ce26ed-e5d1-44b4-81cc-cc5b709a169f"
},
{
"count_per_dataset": 2,
"dataset_id": "b0e547f0-462b-4f81-b31b-5b0a5d96f537"
}
]
},
File Name: cell_types.csv
Description: This file contains the cell type annotation for each barcode and can be used to import the fine-scale cell type annotations directly into Loupe Browser.
The Azimuth model CSV file contains these columns:
barcode: The segmented cell or bin being annotated.broad_cell_type: The high-level annotation of the cell type. For cells with low UMI counts (< 100), this field is set asLow UMI Barcodefor filtering out annotations with low UMI support.coarse_cell_type: The mid-level annotation of the cell type. Those coarse cell types are the display nodes we manually curated. For cells with low UMI counts (< 100), this field is set asLow UMI Barcodefor filtering out annotations with low UMI support.fine_cell_type: The original annotation derived from the model based on the most common cell type amongst the 500 nearest-neighbors. Note: This may be the same ascoarse_cell_typeif the original reference was only annotated to that level of detail.full_hierarchical_labels: a concatenation of the broad, coarse, and fine cell types, separated by pipes (|).final_level_softmax_prob: a probabilistic estimate of how correct the cell annotation is.coarse_cell_type_unfiltered: The high-level annotation of the cell type (e.g., T Cell, B Cell, Neutrophil, etc.). Those coarse cell types are the display nodes we manually curated. Not subject to filtering by UMI count.umi_count: The number of UMIs associated with the cell.
Here is an example:
AAACATGCAAGGTGCA-1,Immune cell,Macrophage,Microglia,Immune cell|Myeloid cell|Macrophage|Tissue resident macrophage|Microglia,0.9963322,Macrophage,7722
AAACATGCAATAATGG-1,Neuron,Excitatory neuron,L5/6 excitatory neuron,Neuron|Neuron of brain|Excitatory neuron|Cortical excitatory neuron|Deep-layer excitatory neuron|L5/6 excitatory neuron,0.9998841,Excitatory neuron,85142
AAACATGCACGTGCTG-1,Glial cell,Oligodendrocyte,Oligodendrocyte,Glial cell|Glial cell of brain|Oligodendrocyte,0.9929638,Oligodendrocyte,6131
AAACATGCAGGATGGC-1,Glial cell,Oligodendrocyte,Oligodendrocyte,Glial cell|Glial cell of brain|Oligodendrocyte,0.8551732,Oligodendrocyte,10731
AAACATGCATACCCGG-1,Glial cell,Oligodendrocyte,Oligodendrocyte,Glial cell|Glial cell of brain|Oligodendrocyte,0.9151714,Oligodendrocyte,9645
The 10x Cloud model CSV file contains these columns:
barcode: The cell barcode being annotated.coarse_cell_type: The high-level annotation of the cell type (e.g., T Cell, B Cell, Neutrophil, etc.). Those coarse cell types are the display nodes we manually curated. For barcodes with low UMI counts (< 100), this field is set asLow UMI Barcodefor filtering out annotations with low UMI support.fine_cell_type: The original annotation derived from the model based on the most common cell type amongst the 500 nearest-neighbors. Note: This may be the same ascoarse_cell_typeif the original reference was only annotated to that level of detail.cell_count_in_model: The number of cells in the model that support the givenfine_cell_typeannotation, with a maximum of 500 cells.coarse_cell_type_unfiltered: The high-level annotation of the cell type (e.g., T Cell, B Cell, Neutrophil, etc.). Those coarse cell types are the display nodes we manually curated. Not subject to filtering by UMI count.umi_count: The number of UMIs associated with the barcode.
Here is an example:
barcode,coarse_cell_type,fine_cell_type,summary_score,cell_count_in_model,coarse_cell_type_unfiltered,umi_count
AAACAGCCAAACTAAG-1,neuron,neuron,0.556,278,neuron,12486
AAACAGCCACATTGCA-1,glial cell,oligodendrocyte,1.0,500,glial cell,5875
AAACAGCCACCTACGG-1,neuron,neuron,0.712,356,neuron,8740
AAACAGCCACGAATCC-1,glial cell,microglial cell,0.734,367,glial cell,10136
The number shown for cell_count_in_model reflects the level of support for a cell's annotation. The algorithm identifies the 500 most similar cells in the reference set using embeddings from both the query dataset and the reference database. The cell type assigned is the annotation that appears most frequently among these 500 nearest neighbors. For example, if 400 of the nearest neighbors are labeled as T cells and 100 as lymphocytes, the cell will be annotated as a T cell, and the cell_count_in_model will be 400. The maximum possible value for this metric is 500.
This number should be interpreted with caution as an indicator of confidence in the model’s assignment. A high fraction of nearest neighbors supporting a fine-level cell type can suggest greater confidence in the annotation. However, a low value does not necessarily indicate low confidence in the coarse cell type assignment. For example, a T cell coarse-level annotation might be supported by different T cell subtypes, each represented by relatively few cells, but with all 500 nearest neighbors still classified as T cells based on Cell Ontology terms. This nuance highlights two key points: (1) the cell_count_in_model should not be treated as a confidence metric, and (2) there is no threshold that can reliably serve as a confidence cutoff for this number.
File Name: cell_annotation_differential_expression.csv
Description: This file contains the results of a differential expression analysis conducted between coarse cell types. These differentially expressed genes can be used to check that the cell type contains the expected marker genes. The pipeline uses the same algorithm employed in Cell Ranger and Loupe Browser to calculate fold changes and p-values, ensuring consistency within these platforms.