Creating Single Cell References for Xenium Custom Panel Design from Seurat or AnnData

Dec 11, 2023

Share via:

Note: 10x Genomics does not provide support for community-developed tools and makes no guarantees regarding their function or performance. Please contact tool developers with any questions. If you have feedback about Analysis Guides, please email analysis-guides@10xgenomics.com.

The Xenium In Situ platform measures gene expression with panels of probes that target genes of interest. In addition to using 10x Genomics pre-designed panels, you can request custom panels using the Xenium Panel Designer web app. The panel design process is described in detail here.

The Xenium Panel Designer uses single cell RNA sequencing data to improve the panel design. The data is used to maximize assay performance by prescreening potentially problematic genes and optimizing the assignment of genes to codewords. Although we provide a set of curated single cell atlases through the web app, providing single cell data that is closely matched for the sample you intend to study with your Xenium panel will provide the best results, especially for cancer samples. Furthermore, you may have a specific tissue region or disease that you are interested in that is not well-represented in the data we provide.

The Xenium Panel Designer accept scRNA-seq data in either Cell Ranger h5 format or in Matrix Exchange Format (MEX). If you only have Seurat or AnnData objects, they cannot be used automatically in the Xenium Panel Designer. This guide will demonstrate how to convert your annotated Seurat or AnnData data to a format that the Xenium Panel Designer accepts. We will show how to convert them into MEX format, which is well-supported by software in both R and Python.

To follow along with this guide, start by installing the software and downloading the example dataset.

For converting Seurat data to MEX with R, install the software below (versions used in this guide in parentheses):

R (v4.2.3)
Seurat (v4.1.0)
dplyr (v1.1.3)
DropletUtils (v1.18.1)

For converting AnnData to MEX with Python, install the software below (versions used in this guide in parentheses):

Python (v3.11.6)
h5py (v3.10.0)
scanpy (v1.9.6)

We will be using the Tabula Sapiens - Pancreas data from CELLxGENE for this guide. The dataset can be downloaded from here in two formats:

.h5ad AnnData v0.8 (344 MB): download
.rds Seurat v4 (365 MB): download

After downloading, rename the files to local.rds and local.h5ad, so you can directly copy and paste the code in the tutorial below.

The dataset from Tabula Sapiens is already annotated and filtered for high-quality cells. If you are using your own data, please first annotate cell types and filter for high-quality cells. Resources on cell type annotation and data filtering can be found in other Analysis Guide articles.

The Xenium Panel Designer requires unnormalized counts for all detected genes (i.e., not filtered for protein-coding or non-mitochondrial). These "raw" counts are typically stored in the slot called counts in an "RNA" assay within your Seurat object. To be sure, we can inspect the Seurat object and confirm that the assay contains positive integers.

All of the code examples in this section of the guide are in the R programming language.


# Load libraries
library(Seurat)
library(dplyr)

# Read in the dataset
seurat_obj <- readRDS("local.rds")

In this section, we check the following:

The counts are integers
Gene ID and name are both present
Annotation information is present

Confirm that all the values stored in the matrix are integers. If your counts contain non-integer values, they are potentially normalized and not true UMI counts. Please make sure your data have not been normalized. Otherwise, the panel utilization and gene utilization estimation calculated by the Xenium Panel Designer may be incorrect.


all.equal(GetAssayData(object=seurat_obj, assay="RNA", slot="counts")@x, as.integer(GetAssayData(object=seurat_obj, assay="RNA", slot="counts")@x))

# Output
## [1] TRUE

Take a look at the first set of values, they should all be integer format.


head(GetAssayData(object=seurat_obj, assay="RNA", slot="counts")@x)

# Output
## [1]  1  2 18  1  2  1

You may need to use layer="counts" for later versions of Seurat. See Seurat documentation for guidance.

We also require that both Ensembl IDs and gene symbols are passed to the Xenium Panel Designer. In this case, it seems like the Ensembl IDs are on the rownames of the Seurat object, while the gene symbols are stored within the assay’s meta features in a column called feature_name. This may be different in your case, and you should be careful to ensure that you identify both of these correctly.


# Check row names
head(rownames(seurat_obj))

# Output
## [1] "ENSG00000223972" "ENSG00000227232" "ENSG00000278267" "ENSG00000243485"
## [5] "ENSG00000284332" "ENSG00000237613"

# Check meta features
dplyr::glimpse(GetAssay(seurat_obj)@meta.features)

# Output
## Rows: 58,604
## Columns: 12
## $ feature_type          <fct> Gene Expression, Gene Expression, Gene Expressio…
## $ highly.variable       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ mvp.mean              <dbl> 6.398244e-05, 2.274395e-03, 6.175251e-05, 1.3728…
## $ mvp.dispersion        <dbl> 0.8350443, 2.4422800, 1.2953346, 2.6563521, NaN,…
## $ mvp.dispersion.scaled <dbl> -0.5739472, 0.5332035, -0.2568744, 0.6806679, 0.…
## $ mean                  <dbl> 3.881082e-05, 1.079995e-03, 3.267138e-05, 4.7735…
## $ std                   <dbl> 5.573522e-03, 3.173095e-02, 5.634017e-03, 8.0407…
## $ ensembl_version       <chr> "ENSG00000223972.5", "ENSG00000227232.5", "ENSG0…
## $ feature_is_filtered   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ feature_name          <fct> DDX11L1, WASH7P, MIR6859-1, MIR1302-2HG, MIR1302…
## $ feature_reference     <fct> NCBITaxon:9606, NCBITaxon:9606, NCBITaxon:9606, …
## $ feature_biotype       <fct> gene, gene, gene, gene, gene, gene, gene, gene, …

Finally, we must also identify where the cell type annotations are stored within the object’s metadata. Using dplyr, you can glimpse the metadata and see that the metadata slot cell_type contains the cell types we are interested in.


# Check metadata
dplyr::glimpse(seurat_obj[[]])

# Output
## Rows: 13,497
## Columns: 26
## $ assay_ontology_term_id                   <fct> EFO:0009922, EFO:0009922, EFO…
## $ donor_id                                 <fct> TSP9, TSP9, TSP9, TSP9, TSP9,…
## $ anatomical_information                   <fct> exocrine, exocrine, exocrine,…
## $ nCounts_RNA_UMIs                         <dbl> 26642, 4053, 8439, 7967, 9405…
## $ nFeaturess_RNA                           <dbl> 3833, 1624, 3067, 1484, 3273,…
## $ cell_ontology_class                      <fct> pancreatic acinar cell, t cel…
## $ free_annotation                          <fct> pancreatic acing cell, T cell…
## $ manually_annotated                       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ compartment                              <fct> epithelial, immune, endotheli…
## $ sex_ontology_term_id                     <fct> PATO:0000384, PATO:0000384, P…
## $ disease_ontology_term_id                 <fct> PATO:0000461, PATO:0000461, P…
## $ is_primary_data                          <lgl> FALSE, FALSE, FALSE, FALSE, F…
## $ organism_ontology_term_id                <fct> NCBITaxon:9606, NCBITaxon:960…
## $ suspension_type                          <fct> cell, cell, cell, cell, cell,…
## $ cell_type_ontology_term_id               <fct> CL:0002064, CL:0000084, CL:00…
## $ tissue_ontology_term_id                  <fct> UBERON:0000017, UBERON:000001…
## $ development_stage_ontology_term_id       <fct> HsapDv:0000131, HsapDv:000013…
## $ self_reported_ethnicity_ontology_term_id <fct> HANCESTRO:0014, HANCESTRO:001…
## $ cell_type                                <fct> pancreatic acinar cell, T cel…
## $ assay                                    <fct> 10x 3' v3, 10x 3' v3, 10x 3' …
## $ disease                                  <fct> normal, normal, normal, norma…
## $ organism                                 <fct> Homo sapiens, Homo sapiens, H…
## $ sex                                      <fct> male, male, male, male, male,…
## $ tissue                                   <fct> exocrine pancreas, exocrine p…
## $ self_reported_ethnicity                  <fct> Hispanic or Latin American, H…
## $ development_stage                        <fct> 37-year-old human stage, 37-y…

The Xenium Panel Designer will only accept files that are 2G or smaller. This should be enough to accommodate approximately 50,000 cells (although this may vary). If your output file is larger, you will need to subsample cells.

We recommend only subsetting abundant cell types to ensure all rare cell barcodes are retained.

We will begin by assuming all cells will fit in the final output (subsample_rate = 1), but if you get a warning at the end of this process, you may need to come back and reduce the subsample_rate.


subsample_rate <- 1

subset <- 1:ncol(seurat_obj)
if (subsample_rate < 1) {
  subset <- sample(subset, subsample_rate * length(subset))
}

In the next step, we will save the files in MEX format. Two options are provided - 1) with DropletUtils or 2) with baseR functions. Select one of these methods and then continue to the last step to save cell type annotations.

One way to convert your Seurat object to MEX format is to use DropletUtils.


# Install DropletUtils
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!require("DropletUtils", quietly = TRUE))
  BiocManager::install("DropletUtils")

# Load library
library(DropletUtils)

We will use the write10xCounts function to save our raw counts to a folder called reference_data/, using the location for the raw counts, gene symbols, and Ensembl IDs that we previously identified.


# Run function
write10xCounts(
  "reference_data",
  GetAssayData(seurat_obj, assay="RNA", slot="counts")[, subset],
  gene.id = rownames(seurat_obj),
  gene.symbol = GetAssay(seurat_obj)@meta.features[["feature_name"]],
  barcodes = colnames(seurat_obj)[subset],
  type = "sparse",
  version = "3"
)

# List the files
list.files("reference_data")

# Output
#[1] "barcodes.tsv.gz" "features.tsv.gz" "matrix.mtx.gz"

Alternatively, you can do the equivalent format conversion with either base R or data.table using the following function:


# Define function
writeCounts <- function(out_dir, data, barcodes = colnames(data), gene.id = rownames(data), gene.symbol = rownames(data), feature.type = "Gene Expression", subset = 1:length(barcodes)) {
  require("R.utils")
  require("Matrix")

  if (file.exists(out_dir) || (dir.exists(out_dir) && length(list.files(out_dir)) > 0)) {
    stop("The specified output directory already exists! Not overwriting")
  }
  dir.create(out_dir, recursive = TRUE)

  if (require("data.table")) {
    data.table::fwrite(
      data.table::data.table(barcode = barcodes[subset]),
      file.path(out_dir, "barcodes.tsv.gz"),
      col.names = FALSE
    )

    data.table::fwrite(
      data.table::data.table(
        gene_id = gene.id,
        gene_symbol = gene.symbol,
        feature_type = feature.type
      ),
      file.path(out_dir, "features.tsv.gz"),
      col.names = FALSE
    )
  } else {
    write.table(
      data.frame(barcode = barcodes[subset]),
      gzfile(file.path(out_dir, "barcodes.tsv.gz")),
      sep = "\t", quote = FALSE,
      col.names = FALSE, row.names = FALSE
    )

    write.table(
      data.frame(
        gene_id = gene.id,
        gene_symbol = gene.symbol,
        feature_type = feature.type
      ),
      file.path(out_dir, "features.tsv.gz"),
      sep = "\t", quote = FALSE,
      col.names = FALSE, row.names = FALSE
    )
  }

  Matrix::writeMM(data[, subset], file.path(out_dir, "matrix.mtx"))
  R.utils::gzip(file.path(out_dir, "matrix.mtx"), remove = TRUE)
}

The function can be executed as so, which should generate the same outputs as DropletUtils.


# Run function
writeCounts(
  "reference_data",
  GetAssayData(seurat_obj, assay="RNA", slot="counts"),
  gene.id = rownames(seurat_obj),
  gene.symbol = GetAssay(seurat_obj)@meta.features[["feature_name"]],
  feature.type = GetAssay(seurat_obj)@meta.features[["feature_type"]],
  barcodes = colnames(seurat_obj),
  subset = subset
)

# List files
list.files("reference_data")

# Output
#[1] "barcodes.tsv.gz" "features.tsv.gz" "matrix.mtx.gz"

Finally, we will save cell type information to a file so the Xenium Panel Designer knows what cell type to associate with each barcode. Earlier, we identified the cell type location. We need to write it to the same directory as the MEX data and compress the output directory. The function below is written to output a zip file.


# Define function
bundleOutputs <- function(out_dir, data, barcodes = colnames(data), cell_type = "cell_type", subset = 1:length(barcodes)) {

  if (require("data.table", quietly = TRUE)) {
    data.table::fwrite(
      data.table::data.table(
        barcode = barcodes,
        annotation = unlist(data[[cell_type]])
      )[subset, ],
      file.path(out_dir, "annotations.csv")
    )
  } else {
    write.table(
      data.frame(
        barcode = barcodes,
        annotation = unlist(data[[cell_type]])
      )[subset, ],
      file.path(out_dir, "annotations.csv"),
      sep = ",", row.names = FALSE
    )
  }

  bundle <- file.path(out_dir, paste0(basename(out_dir), ".zip"))

  utils::zip(
    bundle,
    list.files(out_dir, full.names = TRUE),
    zip = "zip"
  )

  if (file.info(bundle)$size / 1e6 > 2000) {
    warning("The output file is more than 2G and will need to be subset further.")
  }
}

The function can be run as shown below. If you get a warning that your file is too large, go back to the subset data step.


# Run function
bundleOutputs(out_dir = "reference_data", data = seurat_obj, subset = subset)

Check your working directory for a reference_data directory, which contains the files listed below (example dataset output should be 280.9 MB).


# List files
list.files("reference_data")

# Output
#[1] "annotations.csv"    "barcodes.tsv.gz"    "features.tsv.gz"    "matrix.mtx.gz"      "reference_data.zip"

If there is no warning, you are ready to upload the .zip file to the Xenium Panel Designer to build your panel. Learn more about the Xenium custom panel design process here.

The first step to converting h5ad to MEX is to identify which keys store the relevant data in the h5ad file. You must identify the observation keys for:

Cell barcodes (cell barcodes are usually the index)
Cell types

You must also identify the variable keys for:

Gene name
Gene ID

First, we will show an example of how to identify these keys in an example AnnData object from CELLxGENE. Then, we provide a function to then use this information to write a zipped file ready for upload to the Xenium Panel Designer.

All of the code examples in this section of the guide are in the Python programming language.

For this example, we have provided a small h5ad file obtained from Tabula Sapiens (download data above). This file has been filtered to only have cells from the pancreas. In the next set of steps, we will identify the required keys in this file that will be used to convert to MEX format.


# Load the h5ad file using h5py
import h5py

# Edit the file path in this command to point to the h5ad file on your computer
obj = h5py.File("local.h5ad")

The Xenium Panel Designer requires unnormalized data. It is common to put normalized data in the X field and retain the raw counts in the raw field.

Taking a look at the attributes of this h5ad file, we can see that there is both X and raw:


# Print keys
obj.keys()

# Output
# <KeysViewHDF5 ['X', 'layers', 'obs', 'obsm', 'obsp', 'raw', 'uns', 'var', 'varm', 'varp']>

Now use scanpy to load the object. If the matrix is large, this could take a few minutes.


# Load the h5ad file using scanpy
import scanpy as sc

# Edit the file path in this command to point to the h5ad file on your computer
ad = sc.read_h5ad("local.h5ad")

Since we have identified that this h5ad file has a raw section, extract it to a new AnnData object.


raw_ad = ad.raw.to_adata()

Next, inspect the obs and var fields. We are looking for columns that contain gene common names and Ensembl gene IDs in var and cell types and barcodes in obs, which will be used to save the data in MEX format.


# Print field contents
raw_ad.obs.columns

# Output
#Index(['assay_ontology_term_id', 'donor_id', 'anatomical_information',
#'n_counts_UMIs', 'n_genes', 'cell_ontology_class', 'free_annotation',
#'manually_annotated', 'compartment', 'sex_ontology_term_id',
#'disease_ontology_term_id', 'is_primary_data',
#'organism_ontology_term_id', 'suspension_type',
#'cell_type_ontology_term_id', 'tissue_ontology_term_id',
#'development_stage_ontology_term_id',
#'self_reported_ethnicity_ontology_term_id', 'cell_type', 'assay',
#'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity',
#'development_stage'],
# dtype='object')

This tells us that the cell_type_key is cell_type:


# Print information for the first row
raw_ad.obs[["cell_type"]].head(1)

# Output
#                                                    cell_type
#cell_id
#AAACCCACAGCTATTG_TSP9_Pancreas_exocrine_10X_1_1...  pancreatic acinar cell

For the cell barcodes, since they are the index of the obs dataframe, we do not need to pull a key name. This tells us that the gene_id_key is ensemblid and the gene_name_key is feature_name.


# Print information for the first row
raw_ad.var.head(1)

# Output
#                 feature_type    ensembl_version feature_name feature_reference feature_biotype
#ensemblid
#ENSG00000223972  Gene Expression  ENSG00000223972.5      DDX11L1    NCBITaxon:9606            gene

Check the data type of your matrix. If it shows float32 or float64, then it is a float and not an integer.


print(raw_ad.X.dtype)

Convert the float values to integers before converting to MEX:


raw_ad.X = raw_ad.X.astype(int)

Now we can go ahead and convert to MEX. The function below is written to output a zip file.


# Import libraries
import pandas as pd
import scipy
import scipy.sparse as sparse
import zipfile
import tempfile
import os
import gzip

# Define function
def h5ad_to_10x(ad,
                gene_id_key="gene_id",
                gene_name_key="gene_name",
                cell_type_key="cell_type",
                output_path="matrix.zip",
                barcode_key=None,
                subsample_rate=None):
    """
    Convert an AnnData h5ad file to 10x format,
    then produce a zipfile of the final matrix.

    Arguments:
        ad: A scanpy AnnData object.
        gene_id_key: The key that the gene IDs are under.
        gene_name_key: The key that the gene names are under.
        cell_type_key: The key that the cell types are under.
        barcode_key: Optional key that the barcodes are under. If not set,
            will assume they are the index of `obs`.
        output_path: Path to write the zipfile to.
        subsample_rate: Optional argument for subsampling.
            If provided, should be between 0 and 1.
    """
    if subsample_rate:
        sc.pp.subsample(ad, subsample_rate)

    genes = ad.var.reset_index()[[gene_id_key, gene_name_key]]
    genes["feature_type"] = ["Gene Expression"] * len(genes)

    if barcode_key:
        barcodes = ad.obs[[barcode_key]]
    else:
        barcodes = pd.DataFrame(ad.obs.index)

    celltypes = ad.obs[[cell_type_key]].reset_index()
    celltypes.columns = ["barcode", "annotation"]

    with tempfile.TemporaryDirectory() as tmp_dir:

        with gzip.open(os.path.join(tmp_dir, "matrix.mtx.gz"), "w") as handle:
            # Write in compressed sparse column matrix format
            scipy.io.mmwrite(handle, sparse.csc_matrix(ad.X.T))

        genes.to_csv(os.path.join(tmp_dir, "features.tsv.gz"), sep="\t", index=False, header=False, compression="gzip")
        barcodes.to_csv(os.path.join(tmp_dir, "barcodes.tsv.gz"), sep="\t", index=False, header=False, compression="gzip")
        celltypes.to_csv(os.path.join(tmp_dir, "celltypes.csv"), index=False)

        with zipfile.ZipFile(output_path, "w") as zip_handle:
            for file in ["matrix.mtx.gz", "features.tsv.gz", "barcodes.tsv.gz", "celltypes.csv"]:
                zip_handle.write(os.path.join(tmp_dir, file), arcname=file)


# Run function to save the matrix as a MEX file using the keys we identified earlier
h5ad_to_10x(raw_ad, gene_id_key="ensemblid", gene_name_key="feature_name", cell_type_key="cell_type")

Check your working directory for a file called matrix.zip (example dataset output should be 152.7 MB).

The resulting zipfile should be loadable by the scanpy read_10x_mtx function.


# List content information
import os
os.system("unzip -l matrix.zip")

# Output
#Archive:  matrix.zip
#  Length      Date    Time    Name
#---------  ---------- -----   ----
#151244754  11-09-2023 12:14   matrix.mtx.gz
#   479701  11-09-2023 12:14   features.tsv.gz
#    72881  11-09-2023 12:14   barcodes.tsv.gz
#   941922  11-09-2023 12:14   celltypes.csv
#---------                     -------
#152739258                     4 files


# Read the matrix file
with tempfile.TemporaryDirectory() as tmp_dir:
    with zipfile.ZipFile("matrix.zip", "r") as zip_handle:
        zip_handle.extractall(tmp_dir)
    mtx = sc.read_10x_mtx(tmp_dir)


# Convert the matrix to a data frame and print the top
mtx.to_df().head()

# Output
#                                                  DDX11L1  WASH7P  MIR6859-1  MIR1302-2HG  ...  MT-TE  MT-CYB  MT-TT  MT-TP
#AAACCCACAGCTATTG_TSP9_Pancreas_exocrine_10X_1_1...      0.0     0.0        0.0          0.0  ...    0.0   251.0    0.0    3.0
#AAACCCAGTCGTATGT_TSP9_Pancreas_exocrine_10X_1_1...      0.0     0.0        0.0          0.0  ...    0.0    53.0    0.0    0.0
#AAACCCAGTCTGGTTA_TSP9_Pancreas_exocrine_10X_1_1...      0.0     0.0        0.0          0.0  ...    0.0    91.0    0.0    0.0
#AAACCCATCAAGCCGC_TSP9_Pancreas_exocrine_10X_1_1...      0.0     0.0        0.0          0.0  ...    0.0   351.0    0.0    0.0
#AAACCCATCGGCATAT_TSP9_Pancreas_exocrine_10X_1_1...      0.0     0.0        0.0          0.0  ...    0.0    67.0    0.0    0.0
#
#[5 rows x 58604 columns]

Now you are ready to upload the .zip file to the Xenium Panel Designer to build your panel. Learn more about the Xenium custom panel design process here.

Stay connected with latest technical workflow and software updatesSubscribe to newsletter

Creating Single Cell References for Xenium Custom Panel Design from Seurat or AnnData

Note: 10x Genomics does not provide support for community-developed tools and makes no guarantees regarding their function or performance. Please contact tool developers with any questions. If you have feedback about Analysis Guides, please email analysis-guides@10xgenomics.com.

Prerequisites

Install software

Download example dataset

Convert Seurat data to 10x MEX format

Check Seurat object

Generate subset

Option 1: Save MEX files with DropletUtils

Option 2: Save MEX files with base R functions

Save cell type annotations

Convert h5ad (AnnData) data to 10x MEX format

Normalized data

Load AnnData

Extract the raw matrix

Identify the required keys

Check matrix data type

Save as MEX files

Check results