Overview of Xenium Zarr Output Files

Important

See what's new in the Xenium Onboard Analysis software pipeline. Click here to read the release notes.

The Xenium Onboard Analysis pipeline generates several output files using the Zarr format. Xenium Explorer reads these files to display cell segmentation, secondary analysis clustering results, and transcript assignment on the nuclei-stained tissue morphology images.

The Zarr format saves large amounts of data by storing them as compressed chunks of N-dimensional arrays. Zarr files can be read and modified with Python. The zarr Python library documentation and tutorials are available here.

In the sections below, we describe the group arrays and attributes associated with each Zarr file in the Xenium output bundle, along with example Python code for viewing these files.

The cells.zarr.zip output file contains the cell and nucleus segmentation masks used for transcript assignment and the polygon boundaries used for visualization. It is the only file in the output bundle where you can find the cell and nucleus segmentation data.

The masks in the cell Zarr file can be used for downstream cell segmentation or morphology analysis. The polygons are a simplification of the true segmentation masks and are only intended for data visualization (i.e., in Xenium Explorer).

It has the following hierarchy of array data:


(root)
├── cell_id
├── cell_summary
├── masks
│   ├── 0
│   ├── 1
│   └── homogeneous_transform
├── polygon_num_vertices
├── polygon_vertices
└── seg_mask_value

Arrays

Description of root group arrays:

Path	Type	Description
`/cell_id`	uint32	The first column consists of the `cell_id` prefix, and the second column is the dataset suffix (see Cell ID format mapping for string conversion).
`/cell_summary`	float64	An array containing information about each cell (see attributes below).
`/polygon_num_vertices`	int32	Each element is the number of vertices for a given polygon, including a repeat of the initial vertex. A polygon with no vertices indicates the absence of a polygon for that cell.
`/polygon_vertices`	float32	The XY coordinates in physical space (µm) for each vertex in the polygon. The coordinates for the first vertex are repeated at the end.
`/seg_mask_value`	int32	Each element is an index for each cell in the segmentation mask images (cell and nucleus masks).

Description of segmentation /masks arrays:

Path	Type	Description
`[mask_index]`	uint32	Contains masks for the nucleus and cell segmentations in image space. The `mask_index=0` is the nucleus segmentation mask and the `mask_index=1` is the cell segmentation mask. The arrays at these indices contain the masks and have a 2D shape (rows, columns) of the morphology image that segmentation was performed on. Each value is the cell index for that pixel. Pixels with `value=0` are background, and the cell indices (`seg_mask_value`) start at 1.
`homogeneous_transform`	float32	The 4x4 transform matrix used to convert data from physical space (microns) to stitched-image space (pixels). This is needed to generate polygons from the masks.

Attributes

Description of root group attributes:

Field	Type	Description
`major_version`	int	Major version for the `cells.zarr.zip` file. This number is increased when breaking changes are made.
`minor_version`	int	Minor version.
`number_cells`	int	The number of cells in the dataset.
`polygon_set_names`	list[str]	Each element is the unique, machine-readable name of a polygon set (e.g., a single polygon associated with nuclei is called "nucleus").
`polygon_set_display_names`	list[str]	Each element is the display name of a polygon set in Xenium Explorer (e.g., "Nucleus boundaries", "Cell boundaries").
`polygon_set_descriptions`	list[str]	Each element is the description of a polygon set in Xenium Explorer (e.g., "DAPI-based nuclei segmentation", "Expansion of nuclei boundaries by XX µm").
`spatial_units`	str	The units of the stitched image space ("microns").

Description of the /cell_summary array columns (type f64):

Field	Description
`cell_centroid_x`	X coordinate of cell centroid in µm.
`cell_centroid_y`	Y coordinate of cell centroid in µm.
`cell_area`	Area of cell in µm².
`nucleus_centroid_x`	X coordinate of nucleus centroid in µm.
`nucleus_centroid_y`	Y coordinate of nucleus centroid in µm.
`nucleus_area`	Area of nucleus in µm².
`z_level`	Z-level in which the cell was found in µm.

The analysis.zarr.zip output file contains the automated secondary analysis clustering results. It has the following hierarchy of array data:


(root)
└── cell_groups
    ├── 0
    │   ├── indices
    │   └── indptr
    ├── 1
    │   ├── indices
    │   └── indptr
    ├── [...]
    └── 9
        ├── indices
        └── indptr

There are 10 cell clustering results (clustering_index = 0 - 9) stored in this file - the first for graph-based clustering and the remaining for K-means clustering (K = 2 - 10). Descriptions for /cell_groups/[clustering_index] group arrays:

Path	Type	Description
`/indices`	uint32	An array of the cell indices for all cells assigned to one of the clusters in the secondary analysis. Cluster assignment determines the order of cell indices in each of these `cell_groups/[clustering_index]` arrays.
`/indptr`	uint32	An array that indicates the cell index value (row) where each new cluster assignment begins in `/cell_groups/[clustering_index]/indices`. For example, "[0, 218440]" for `cell_groups/1` means that cluster 1 starts at the 1st element of indices and cluster 2 starts at the 218,441st element of indices (0-based indexing).

Descriptions for the cell_groups group attributes:

Field	Type	Description
`major_version`	int	Major version for the `analysis.zarr.zip` file. This number is increased when breaking changes are made.
`minor_version`	int	Minor version.
`number_groupings`	int	The number of clustering results in the dataset (graph-based and K-means clusters).
`grouping_names`	list[str]	Contains a list of unique clustering method names for all the clustering results (e.g., "gene_expression_graphclust", "gene_expression_kmeans_2_clusters").
`group_names`	list[list[str]]	For each of the clustering result groups (e.g., "gene_expression_kmeans_2_clusters"), there is an inner list of all the clusters in the group (e.g., "[‘Cluster 1’, ‘Cluster 2’]").

The cell_feature_matrix.zarr.zip output file contains a matrix of counts per cell and per feature (including gene and non-gene codewords), which have passed the default quality value (Q-Score) threshold of Q20. It has the following hierarchy of array data:


(root)
└── cell_features
    ├── cell_id
    ├── data
    ├── indices
    └── indptr

Description for /cell_features group arrays:

Path	Type	Description
`/cell_id`	uint32	The first column consists of the `cell_id` prefix, and the second column is the dataset suffix (see Cell ID format mapping for string conversion).
`/data`	uint32	An array of counts (Q-Score ≥ 20) for a particular cell and specified feature, stored in a compressed sparse row (CSR) format (array V) that only contains nonzero counts.
`/indices`	uint32	Contains column indices (`column_index` in CSR format) that specify the cell index for each nonzero count value in the `/data` values array.
`/indptr`	uint32	Contains indices (`row_index` in CSR format) where each group of nonzero counts starts for a given feature in `/data`. For example, "[0, 21282, 28505, …]" means the nonzero counts for feature 1 start at 0, the nonzero counts for feature 2 start at 21282, etc. for all features in the dataset.

Description for /cell_features group attributes:

Field	Type	Description
`major_version`	int	Major version for the `cell_feature_matrix.zarr.zip` file. This number is increased when breaking changes are made.
`minor_version`	int	Minor version.
`number_cells`	int	The number of cells in the dataset.
`number_features`	int	The number of features (e.g., genes, controls, unassigned) in the dataset.
`feature_keys`	list[str]	Each element is the name of the feature (e.g., gene name).
`feature_ids`	list[str]	Each element is the ID of the feature (e.g., gene id).
`feature_types`	list[str]	Each element is the type of the feature (e.g., gene, negative_control_codeword).

The transcripts.zarr.zip output file contains data to evaluate transcript quality and localization. It has the following hierarchy of array data:


(root)
├── codeword_category
├── gene_category
├── density
│   └── gene
│       ├── data
│       ├── indices
│       └── indptr
└── grids
    ├── 0
    │   ├── 0,0
    │   │   ├── codeword_identity
    │   │   ├── gene_identity
    │   │   ├── id
    │   │   ├── location
    │   │   ├── quality_score
    │   │   ├── status
    │   │   ├── uuid
    │   │   └── valid
    │   ├── 0,1
    │   │   ├── codeword_identity
    │   │   ├── gene_identity
    │   │   ├── id
    │   │   ├── location
    │   │   ├── quality_score
    │   │   ├── status
    │   │   ├── uuid
    │   │   └── valid
    │   ├── [X,Y]
[...]

The /density array and associated attributes contain transcript density bin information, which is shown in the analysis_summary.html Region Details panel and in the transcript density view in Xenium Explorer.

The /grids arrays contain a pyramid structure of downsampled transcript levels. The transcript information is stored in this structure as a way to divide it into smaller chunks and for subsampling at zoomed out views. The number of levels corresponds to the selected tissue region size; smaller regions require fewer levels to store subsampled transcript information.

For example, if there are seven levels in total, grids/0 is the most zoomed in level and grids/6 is the most zoomed out level. The most zoomed out level contains a subsample of the transcript information and can fit in a single file (0,0). The most zoomed in level describes where every transcript is located, and consequently the chunks of data need to be stored in more files ((0,0), (0,1), etc.); the arrangement of the files is specified in the file names.

Arrays

Description for root group arrays:

Path	Type	Description
`/codeword_category`	bool	A `num_codewords x 7` boolean table that contains information about the categories that codewords belong to. column names and descriptions are contained in `codeword_category/.zattrs`.
`/gene_category`	bool	A `num_genes x 7` boolean table that contains information about the categories that genes belong to. Column names and descriptions are contained in `gene_category/.zattrs`.

Description for /density/gene group arrays:

Path	Type	Description
`/data`	uint16	An array of the Q-Score ≥20 counts for a particular transcript density grid cell and specified gene (chunked at 50,000 elements), stored in a compressed sparse row (CSR) format (array V) that only contains nonzero counts. Each bin is 10 µm. The rows of this matrix are a collapsed encoding of two quantities: (`gene`, `grid_row`). The columns of this matrix correspond to `grid_col`. Where `grid_row` and `grid_col` specify the location in the density grid and `gene` specifies the index of the gene.
`/indices`	uint16	Contains the feature indices (`column_index` in CSR format) that correspond to the order of feature counts in the `/data` values array (chunked at 50,000 elements).
`/indptr`	uint32	Contains indices (`row_index` in CSR format) where each group of counts start for a given density grid cell in `/data`.

Description for /grids/[grid_index]/[grid_position] group arrays:

Path	Type	Description
`/gene_identity`	uint16	The gene index(es) for each transcript. Gene indices are zero-based and reference the `gene_names` attribute attached to the gene parent group (see root attribute table below). Codewords corresponding to no-call (absence of a codeword) are denoted by the value 65535. Columns are: `gene_call`.
`/id`	uint32	The transcript ID (1st column) and FOV index (2nd column). This array is not guaranteed to be sorted in any particular order. The transcript ID is a unique value for each transcript within the FOV.
`/location`	float32	The location of each transcript in physical coordinate space. Columns are: `x_position`, `y_position`, and `z_position` of the transcript.
`/quality_score`	float32	The calibrated Q-Score for each transcript.
`/status`	uint8	The status of a transcript used in the pipeline to indicate that it passed filtering; always 0 if present in final output file.
`/uuid`	uint32	Unique identifier for transcripts; used by the pipeline.
`/valid`	uint8	The status of a transcript used in the pipeline to indicate that it passed filtering; always 1 if present in final output file.
`/codeword_identity`	uint16	The codeword index for each RNA. Codeword indices are zero-based and reference the `codeword_names` attribute attached to the dataset. Unknown codewords are given by max_value (uint16). Currently, the first column indicates the codeword index and the second column is unused.

Attributes

Description for root group attributes:

Field	Type	Description
`name`	str	The name of the dataset ("RnaDataset").
`major_version`	int	Major version for the `transcripts.zarr.zip` file. This number is increased when breaking changes are made.
`minor_version`	int	Minor version.
`dataset_uuid`	str	Unique ID for this dataset.
`data_format`	int	A field for internal pipeline use. Always set to 0.
`number_rnas`	int	The total number of transcripts in the dataset.
`spatial_units`	str	The units of the stitched image space ("micron").
`fov_names`	list[str]	Names of the FOVs used in the dataset as referenced by the FOV indices (`/grids/[grid_index]/[grid_position]/id`).
`number_genes`	int	The number of genes in the dataset.
`gene_names`	list[str]	Names of the genes.
`codeword_count`	int	The number of codewords.
`codeword_gene_mapping`	list[int]	The index of the gene in `gene_names` specified by each codeword.
`codeword_gene_names`	list[str]	The name of the gene in `gene_names` specified by each codeword.
`coordinate_space`	str	For internal pipeline use. Should have the value "refined-final_global_micron".

Note: The key root group attributes used by Xenium Explorer are shown above. This is not a comprehensive attribute list from all Xenium Onboard Analysis versions.

Description for /density/gene array attributes:

Field	Type	Description
`grid_size`	list[float]	List of the XY grid spacings in µm (10 µm in current version).
`rows`	int	The number of density grid (bin) rows.
`cols`	int	The number of density grid (bin) columns.
`gene_names`	list[str]	The names of genes.
`origin`	dict[str,float]	Origin of the grid as `{"x": min_x, "y": min_y}`.

Description for /grids array attributes:

Field	Type	Description
`grid_key_names`	list[str]	The names of the grid keys used by the current grid (e.g., "grid_x_loc").
`number_levels`	int	The number of levels in the grid pyramid (must be ≥1).
`grid_size`	list[float]	The size of a grid element for each grid pyramid level.
`grid_keys`	list[list[str]]	The grid keys (e.g., "0,0,0") for each level of the grid pyramid.
`grid_number_objects`	list[list[str]]	The number of transcripts in each grid element, in each level of the grid pyramid.

The cells.zarr.zip and cell_feature_matrix.zarr.zip have cell_id arrays in integer format (uint32). The first column describes the cell_id_prefix. The polygon vertices of all cells in the dataset determine these integer values. The second column describes the dataset_suffix, and is an integer value defaulting to 1 that may be changed to designate cells originating from different datasets.

Other files (e.g., H5/MTX, CSV) have cell_id in string format (e.g., cmlbdfdf-1). To map between these formats, here is the conversion process from integer to string:

Convert the cell_id_prefix to its hexadecimal (hex) representation. Pad it with leading zeroes so that it has eight digits (i.e., 3d51 becomes 00003d51).
Shift the characters from the normal hex range [0 - 9, a - f] to the range [a - p]:

Hex code	0	1	2	3	4	5	6	7	8	9	a	b	c	d	e	f
Shifted code	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o	p

Add a dash and append the dataset_suffix as an unpadded integer.

For example: given an integer cell_id_prefix = 1437536272 and dataset_suffix = 1

Hex conversion of prefix = "55af1010"
Shifted code = "ffkpbaba"
Append suffix for the final string cell_id = "ffkpbaba-1"

This code snippet shows how to read a Zarr array into numpy N-dimensional arrays:


# Import Python libraries
# This script was tested with zarr v2.13.6
import zarr
import numpy as np

# Function to open a Zarr file
def open_zarr(path: str) -> zarr.Group:
    store = (zarr.ZipStore(path, mode="r")
    if path.endswith(".zip")
    else zarr.DirectoryStore(path)
    )
    return zarr.group(store=store)

# For example, use the above function to open the cells Zarr file, which contains segmentation mask Zarr arrays
root = open_zarr("cells.zarr.zip")

# Look at group array info and structure
root.info
root.tree() # shows structure, array dimensions, data types

# Create cell and nucleus segmentation mask np array objects to read or modify
cellseg_mask = np.array(root["masks"][1])
nucseg_mask = np.array(root["masks"][0])

# Show dimensions of the 2D segmentation mask arrays (also shown in .tree())
# .ndim() shows number of dimensions
# The shape should match the number of pixels in the morphology image.
cellseg_mask.shape
nucseg_mask.shape

# Show max value of cells in the masks (value=0 are background pixels)
# The .max() method counts all the values that are not 0, which should equal
# the total cells detected in the dataset (reported in e.g., analysis_summary.html
# summary tab metric). (This should also be the same value as length of seg_mask_value)
cellseg_mask.max()
nucseg_mask.max()

# Examples for exploring file contents
# How to show array
root["masks"][0][0:9] # or root["masks/0"]
root["cell_summary"][0:9]
# How to show attribute values
root.attrs["major_version"]
# How to list out attribute names and values
dict(root.attrs.items())
dict(root['cell_summary'].attrs.items())

Using the same Python function as above to read in the file, here are a few example lines to view the analysis.zarr.zip and transcripts.zarr.zip arrays and attributes:


# Read in secondary analysis Zarr arrays
root = open_zarr("analysis.zarr.zip")
# Examples for exploring file contents
# How to show a slice of the clustering_index arrays
root["cell_groups"][0]["indices"][0:9]
# How to show attributes
root["cell_groups"].attrs["group_names"]


# Read in transcripts Zarr arrays
root = open_zarr("transcripts.zarr.zip")
# Examples for exploring file contents
# How to show array info
root['grids'][0]['0,0']['gene_identity'].shape
root['grids'][0]['0,0']['quality_score'][0:9]
root['grids'][0]['0,0']['location'][0:9,]
# How to show array attributes
root.attrs['major_version']
root['density']['gene'].attrs['gene_names'][0:9]

Overview

Cells

Secondary analysis

Cell-feature matrix

Transcripts

Cell ID format mapping

Example Python code