10x Genomics Support/Xenium Onboard Analysis 1.9/Analysis/

Overview of Xenium Zarr Output Files

Important
See what's new in the Xenium Onboard Analysis software pipeline. Click here to read the release notes.

The Xenium Onboard Analysis pipeline generates several output files using the Zarr format. Xenium Explorer reads these files to display cell segmentation, secondary analysis clustering results, and transcript assignment on the nuclei-stained tissue morphology images.

The Zarr format saves large amounts of data by storing them as compressed chunks of N-dimensional arrays. Zarr files can be read and modified with Python. The zarr Python library documentation and tutorials are available here.

In the sections below, we describe the group arrays and attributes associated with each Zarr file in the Xenium output bundle, along with example Python code for viewing these files.

The cells.zarr.zip output file contains the cell and nucleus segmentation masks used for transcript assignment and the polygon boundaries used for visualization. It is the only file in the output bundle where you can find the cell and nucleus segmentation data.

The masks in the cell Zarr file can be used for downstream cell segmentation or morphology analysis. The polygons are a simplification of the true segmentation masks and are only intended for data visualization (i.e., in Xenium Explorer).

It has the following hierarchy of array data:

(root) ├── cell_id ├── cell_summary ├── masks │ ├── 0 │ ├── 1 │ └── homogeneous_transform ├── polygon_num_vertices ├── polygon_vertices └── seg_mask_value

Arrays

Description of root group arrays:

PathTypeDescription
/cell_iduint32The first column consists of the cell_id prefix, and the second column is the dataset suffix (see Cell ID format mapping for string conversion).
/cell_summaryfloat64An array containing information about each cell (see attributes below).
/polygon_num_verticesint32Each element is the number of vertices for a given polygon, including a repeat of the initial vertex. A polygon with no vertices indicates the absence of a polygon for that cell.
/polygon_verticesfloat32The XY coordinates in physical space (µm) for each vertex in the polygon. The coordinates for the first vertex are repeated at the end.
/seg_mask_valueint32Each element is an index for each cell in the segmentation mask images (cell and nucleus masks).

Description of segmentation /masks arrays:

PathTypeDescription
[mask_index]uint32Contains masks for the nucleus and cell segmentations in image space. The mask_index=0 is the nucleus segmentation mask and the mask_index=1 is the cell segmentation mask. The arrays at these indices contain the masks and have a 2D shape (rows, columns) of the morphology image that segmentation was performed on. Each value is the cell index for that pixel. Pixels with value=0 are background, and the cell indices (seg_mask_value) start at 1.
homogeneous_transformfloat32The 4x4 transform matrix used to convert data from physical space (microns) to stitched-image space (pixels). This is needed to generate polygons from the masks.

Attributes

Description of root group attributes:

FieldTypeDescription
major_versionintMajor version for the cells.zarr.zip file. This number is increased when breaking changes are made.
minor_versionintMinor version.
number_cellsintThe number of cells in the dataset.
polygon_set_nameslist[str]Each element is the unique, machine-readable name of a polygon set (e.g., a single polygon associated with nuclei is called "nucleus").
polygon_set_display_nameslist[str]Each element is the display name of a polygon set in Xenium Explorer (e.g., "Nucleus boundaries", "Cell boundaries").
polygon_set_descriptionslist[str]Each element is the description of a polygon set in Xenium Explorer (e.g., "DAPI-based nuclei segmentation", "Expansion of nuclei boundaries by XX µm").
spatial_unitsstrThe units of the stitched image space ("microns").

Description of the /cell_summary array columns (type f64):

FieldDescription
cell_centroid_xX coordinate of cell centroid in µm.
cell_centroid_yY coordinate of cell centroid in µm.
cell_areaArea of cell in µm2.
nucleus_centroid_xX coordinate of nucleus centroid in µm.
nucleus_centroid_yY coordinate of nucleus centroid in µm.
nucleus_areaArea of nucleus in µm2.
z_levelZ-level in which the cell was found in µm.

The analysis.zarr.zip output file contains the automated secondary analysis clustering results. It has the following hierarchy of array data:

(root) └── cell_groups ├── 0 │ ├── indices │ └── indptr ├── 1 │ ├── indices │ └── indptr ├── [...] └── 9 ├── indices └── indptr

There are 10 cell clustering results (clustering_index = 0 - 9) stored in this file - the first for graph-based clustering and the remaining for K-means clustering (K = 2 - 10). Descriptions for /cell_groups/[clustering_index] group arrays:

PathTypeDescription
/indicesuint32An array of the cell indices for all cells assigned to one of the clusters in the secondary analysis. Cluster assignment determines the order of cell indices in each of these cell_groups/[clustering_index] arrays.
/indptruint32An array that indicates the cell index value (row) where each new cluster assignment begins in /cell_groups/[clustering_index]/indices. For example, "[0, 218440]" for cell_groups/1 means that cluster 1 starts at the 1st element of indices and cluster 2 starts at the 218,441st element of indices (0-based indexing).

Descriptions for the cell_groups group attributes:

FieldTypeDescription
major_versionintMajor version for the analysis.zarr.zip file. This number is increased when breaking changes are made.
minor_versionintMinor version.
number_groupingsintThe number of clustering results in the dataset (graph-based and K-means clusters).
grouping_nameslist[str]Contains a list of unique clustering method names for all the clustering results (e.g., "gene_expression_graphclust", "gene_expression_kmeans_2_clusters").
group_nameslist[list[str]]For each of the clustering result groups (e.g., "gene_expression_kmeans_2_clusters"), there is an inner list of all the clusters in the group (e.g., "[‘Cluster 1’, ‘Cluster 2’]").

The cell_feature_matrix.zarr.zip output file contains a matrix of counts per cell and per feature (including gene and non-gene codewords), which have passed the default quality value (Q-Score) threshold of Q20. It has the following hierarchy of array data:

(root) └── cell_features ├── cell_id ├── data ├── indices └── indptr

Description for /cell_features group arrays:

PathTypeDescription
/cell_iduint32The first column consists of the cell_id prefix, and the second column is the dataset suffix (see Cell ID format mapping for string conversion).
/datauint32An array of counts (Q-Score ≥ 20) for a particular cell and specified feature, stored in a compressed sparse row (CSR) format (array V) that only contains nonzero counts.
/indicesuint32Contains column indices (column_index in CSR format) that specify the cell index for each nonzero count value in the /data values array.
/indptruint32Contains indices (row_index in CSR format) where each group of nonzero counts starts for a given feature in /data. For example, "[0, 21282, 28505, …]" means the nonzero counts for feature 1 start at 0, the nonzero counts for feature 2 start at 21282, etc. for all features in the dataset.

Description for /cell_features group attributes:

FieldTypeDescription
major_versionintMajor version for the cell_feature_matrix.zarr.zip file. This number is increased when breaking changes are made.
minor_versionintMinor version.
number_cellsintThe number of cells in the dataset.
number_featuresintThe number of features (e.g., genes, controls, unassigned) in the dataset.
feature_keyslist[str]Each element is the name of the feature (e.g., gene name).
feature_idslist[str]Each element is the ID of the feature (e.g., gene id).
feature_typeslist[str]Each element is the type of the feature (e.g., gene, negative_control_codeword).

The transcripts.zarr.zip output file contains data to evaluate transcript quality and localization. It has the following hierarchy of array data:

(root) ├── codeword_category ├── gene_category ├── density │ └── gene │ ├── data │ ├── indices │ └── indptr └── grids ├── 0 │ ├── 0,0 │ │ ├── codeword_identity │ │ ├── gene_identity │ │ ├── id │ │ ├── location │ │ ├── quality_score │ │ ├── status │ │ ├── uuid │ │ └── valid │ ├── 0,1 │ │ ├── codeword_identity │ │ ├── gene_identity │ │ ├── id │ │ ├── location │ │ ├── quality_score │ │ ├── status │ │ ├── uuid │ │ └── valid │ ├── [X,Y] [...]

The /density array and associated attributes contain transcript density bin information, which is shown in the analysis_summary.html Region Details panel and in the transcript density view in Xenium Explorer.

The /grids arrays contain a pyramid structure of downsampled transcript levels. The transcript information is stored in this structure as a way to divide it into smaller chunks and for subsampling at zoomed out views. The number of levels corresponds to the selected tissue region size; smaller regions require fewer levels to store subsampled transcript information.

For example, if there are seven levels in total, grids/0 is the most zoomed in level and grids/6 is the most zoomed out level. The most zoomed out level contains a subsample of the transcript information and can fit in a single file (0,0). The most zoomed in level describes where every transcript is located, and consequently the chunks of data need to be stored in more files ((0,0), (0,1), etc.); the arrangement of the files is specified in the file names.

Arrays

Description for root group arrays:

PathTypeDescription
/codeword_categoryboolA num_codewords x 7 boolean table that contains information about the categories that codewords belong to. column names and descriptions are contained in codeword_category/.zattrs.
/gene_categoryboolA num_genes x 7 boolean table that contains information about the categories that genes belong to. Column names and descriptions are contained in gene_category/.zattrs.

Description for /density/gene group arrays:

PathTypeDescription
/datauint16An array of the Q-Score ≥20 counts for a particular transcript density grid cell and specified gene (chunked at 50,000 elements), stored in a compressed sparse row (CSR) format (array V) that only contains nonzero counts. Each bin is 10 µm. The rows of this matrix are a collapsed encoding of two quantities: (gene, grid_row). The columns of this matrix correspond to grid_col. Where grid_row and grid_col specify the location in the density grid and gene specifies the index of the gene.
/indicesuint16Contains the feature indices (column_index in CSR format) that correspond to the order of feature counts in the /data values array (chunked at 50,000 elements).
/indptruint32Contains indices (row_index in CSR format) where each group of counts start for a given density grid cell in /data.

Description for /grids/[grid_index]/[grid_position] group arrays:

PathTypeDescription
/gene_identityuint16The gene index(es) for each transcript. Gene indices are zero-based and reference the gene_names attribute attached to the gene parent group (see root attribute table below). Codewords corresponding to no-call (absence of a codeword) are denoted by the value 65535. Columns are: gene_call.
/iduint32The transcript ID (1st column) and FOV index (2nd column). This array is not guaranteed to be sorted in any particular order. The transcript ID is a unique value for each transcript within the FOV.
/locationfloat32The location of each transcript in physical coordinate space. Columns are: x_position, y_position, and z_position of the transcript.
/quality_scorefloat32The calibrated Q-Score for each transcript.
/statusuint8The status of a transcript used in the pipeline to indicate that it passed filtering; always 0 if present in final output file.
/uuiduint32Unique identifier for transcripts; used by the pipeline.
/validuint8The status of a transcript used in the pipeline to indicate that it passed filtering; always 1 if present in final output file.
/codeword_identityuint16The codeword index for each RNA. Codeword indices are zero-based and reference the codeword_names attribute attached to the dataset. Unknown codewords are given by max_value (uint16). Currently, the first column indicates the codeword index and the second column is unused.

Attributes

Description for root group attributes:

FieldTypeDescription
namestrThe name of the dataset ("RnaDataset").
major_versionintMajor version for the transcripts.zarr.zip file. This number is increased when breaking changes are made.
minor_versionintMinor version.
dataset_uuidstrUnique ID for this dataset.
data_formatintA field for internal pipeline use. Always set to 0.
number_rnasintThe total number of transcripts in the dataset.
spatial_unitsstrThe units of the stitched image space ("micron").
fov_nameslist[str]Names of the FOVs used in the dataset as referenced by the FOV indices (/grids/[grid_index]/[grid_position]/id).
number_genesintThe number of genes in the dataset.
gene_nameslist[str]Names of the genes.
codeword_countintThe number of codewords.
codeword_gene_mappinglist[int]The index of the gene in gene_names specified by each codeword.
codeword_gene_nameslist[str]The name of the gene in gene_names specified by each codeword.
coordinate_spacestrFor internal pipeline use. Should have the value "refined-final_global_micron".

Note: The key root group attributes used by Xenium Explorer are shown above. This is not a comprehensive attribute list from all Xenium Onboard Analysis versions.

Description for /density/gene array attributes:

FieldTypeDescription
grid_sizelist[float]List of the XY grid spacings in µm (10 µm in current version).
rowsintThe number of density grid (bin) rows.
colsintThe number of density grid (bin) columns.
gene_nameslist[str]The names of genes.
origindict[str,float]Origin of the grid as {"x": min_x, "y": min_y}.

Description for /grids array attributes:

FieldTypeDescription
grid_key_nameslist[str]The names of the grid keys used by the current grid (e.g., "grid_x_loc").
number_levelsintThe number of levels in the grid pyramid (must be ≥1).
grid_sizelist[float]The size of a grid element for each grid pyramid level.
grid_keyslist[list[str]]The grid keys (e.g., "0,0,0") for each level of the grid pyramid.
grid_number_objectslist[list[str]]The number of transcripts in each grid element, in each level of the grid pyramid.

The cells.zarr.zip and cell_feature_matrix.zarr.zip have cell_id arrays in integer format (uint32). The first column describes the cell_id_prefix. The polygon vertices of all cells in the dataset determine these integer values. The second column describes the dataset_suffix, and is an integer value defaulting to 1 that may be changed to designate cells originating from different datasets.

Other files (e.g., H5/MTX, CSV) have cell_id in string format (e.g., cmlbdfdf-1). To map between these formats, here is the conversion process from integer to string:

  1. Convert the cell_id_prefix to its hexadecimal (hex) representation.
  2. Shift the characters from the normal hex range [0 - 9, a - f] to the range a - p (where a = 0, b = 1, c = 2, ..., p = 15).
  3. Add a dash and append the dataset_suffix as an unpadded integer.

For example: integer cell_id_prefix = 1437536272 and dataset_suffix = 1

  • Hex conversion of prefix = "55af1010"
  • String cell_id = "ffkpbaba-1"

This code snippet shows how to read a Zarr array into numpy N-dimensional arrays:

# Import Python libraries # This script was tested with zarr v2.13.6 import zarr import numpy as np # Function to open a Zarr file def open_zarr(path: str) -> zarr.Group: store = (zarr.ZipStore(path, mode="r") if path.endswith(".zip") else zarr.DirectoryStore(path) ) return zarr.group(store=store) # For example, use the above function to open the cells Zarr file, which contains segmentation mask Zarr arrays root = open_zarr("cells.zarr.zip") # Look at group array info and structure root.info root.tree() # shows structure, array dimensions, data types # Create cell and nucleus segmentation mask np array objects to read or modify cellseg_mask = np.array(root["masks"][1]) nucseg_mask = np.array(root["masks"][0]) # Show dimensions of the 2D segmentation mask arrays (also shown in .tree()) # .ndim() shows number of dimensions # The shape should match the number of pixels in the morphology image. cellseg_mask.shape nucseg_mask.shape # Show max value of cells in the masks (value=0 are background pixels) # The .max() method counts all the values that are not 0, which should equal # the total cells detected in the dataset (reported in e.g., analysis_summary.html # summary tab metric). (This should also be the same value as length of seg_mask_value) cellseg_mask.max() nucseg_mask.max() # Examples for exploring file contents # How to show array root["masks"][0][0:9] # or root["masks/0"] root["cell_summary"][0:9] # How to show attribute values root.attrs["major_version"] # How to list out attribute names and values dict(root.attrs.items()) dict(root['cell_summary'].attrs.items())

Using the same Python function as above to read in the file, here are a few example lines to view the analysis.zarr.zip and transcripts.zarr.zip arrays and attributes:

# Read in secondary analysis Zarr arrays root = open_zarr("analysis.zarr.zip") # Examples for exploring file contents # How to show a slice of the clustering_index arrays root["cell_groups"][0]["indices"][0:9] # How to show attributes root["cell_groups"].attrs["group_names"] # Read in transcripts Zarr arrays root = open_zarr("transcripts.zarr.zip") # Examples for exploring file contents # How to show array info root['grids'][0]['0,0']['gene_identity'].shape root['grids'][0]['0,0']['quality_score'][0:9] root['grids'][0]['0,0']['location'][0:9,] # How to show array attributes root.attrs['major_version'] root['density']['gene'].attrs['gene_names'][0:9]