Analysis Guides/

Creating Single Cell References for Xenium Custom Panel Design from Seurat or AnnData

Dec 11, 2023
Share via:

Note: 10x Genomics does not provide support for community-developed tools and makes no guarantees regarding their function or performance. Please contact tool developers with any questions. If you have feedback about Analysis Guides, please email [email protected].

The Xenium In Situ platform measures gene expression with panels of probes that target genes of interest. In addition to using 10x Genomics pre-designed panels, you can request custom panels using the Xenium Panel Designer web app. The panel design process is described in detail here.

The Xenium Panel Designer uses single cell RNA sequencing data to improve the panel design. The data is used to maximize assay performance by prescreening potentially problematic genes and optimizing the assignment of genes to codewords. Although we provide a set of curated single cell atlases through the web app, providing single cell data that is closely matched for the sample you intend to study with your Xenium panel will provide the best results, especially for cancer samples. Furthermore, you may have a specific tissue region or disease that you are interested in that is not well-represented in the data we provide.

The Xenium Panel Designer accept scRNA-seq data in either Cell Ranger h5 format or in Matrix Exchange Format (MEX). If you only have Seurat or AnnData objects, they cannot be used automatically in the Xenium Panel Designer. This guide will demonstrate how to convert your annotated Seurat or AnnData data to a format that the Xenium Panel Designer accepts. We will show how to convert them into MEX format, which is well-supported by software in both R and Python.

To follow along with this guide, start by installing the software and downloading the example dataset.

For converting Seurat data to MEX with R, install the software below (versions used in this guide in parentheses):

  • R (v4.2.3)
  • Seurat (v4.1.0)
  • dplyr (v1.1.3)
  • DropletUtils (v1.18.1)

For converting AnnData to MEX with Python, install the software below (versions used in this guide in parentheses):

  • Python (v3.11.6)
  • h5py (v3.10.0)
  • scanpy (v1.9.6)

We will be using the Tabula Sapiens - Pancreas data from CELLxGENE for this guide. The dataset can be downloaded from here in two formats:

  • .h5ad AnnData v0.8 (344 MB): download
  • .rds Seurat v4 (365 MB): download

After downloading, rename the files to local.rds and local.h5ad, so you can directly copy and paste the code in the tutorial below.

The dataset from Tabula Sapiens is already annotated and filtered for high-quality cells. If you are using your own data, please first annotate cell types and filter for high-quality cells. Resources on cell type annotation and data filtering can be found in other Analysis Guide articles.

The Xenium Panel Designer requires unnormalized counts for all detected genes (i.e., not filtered for protein-coding or non-mitochondrial). These "raw" counts are typically stored in the slot called counts in an "RNA" assay within your Seurat object. To be sure, we can inspect the Seurat object and confirm that the assay contains positive integers.

All of the code examples in this section of the guide are in the R programming language.

# Load libraries library(Seurat) library(dplyr) # Read in the dataset seurat_obj <- readRDS("local.rds")

In this section, we check the following:

  1. The counts are integers
  2. Gene ID and name are both present
  3. Annotation information is present

Confirm that all the values stored in the matrix are integers. If your counts contain non-integer values, they are potentially normalized and not true UMI counts. Please make sure your data have not been normalized. Otherwise, the panel utilization and gene utilization estimation calculated by the Xenium Panel Designer may be incorrect.

all.equal(GetAssayData(object=seurat_obj, assay="RNA", slot="counts")@x, as.integer(GetAssayData(object=seurat_obj, assay="RNA", slot="counts")@x)) # Output ## [1] TRUE

Take a look at the first set of values, they should all be integer format.

head(GetAssayData(object=seurat_obj, assay="RNA", slot="counts")@x) # Output ## [1] 1 2 18 1 2 1
You may need to use layer="counts" for later versions of Seurat. See Seurat documentation for guidance.

We also require that both Ensembl IDs and gene symbols are passed to the Xenium Panel Designer. In this case, it seems like the Ensembl IDs are on the rownames of the Seurat object, while the gene symbols are stored within the assay’s meta features in a column called feature_name. This may be different in your case, and you should be careful to ensure that you identify both of these correctly.

# Check row names head(rownames(seurat_obj)) # Output ## [1] "ENSG00000223972" "ENSG00000227232" "ENSG00000278267" "ENSG00000243485" ## [5] "ENSG00000284332" "ENSG00000237613" # Check meta features dplyr::glimpse(GetAssay(seurat_obj)@meta.features) # Output ## Rows: 58,604 ## Columns: 12 ## $ feature_type <fct> Gene Expression, Gene Expression, Gene Expressio… ## $ highly.variable <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,… ## $ mvp.mean <dbl> 6.398244e-05, 2.274395e-03, 6.175251e-05, 1.3728… ## $ mvp.dispersion <dbl> 0.8350443, 2.4422800, 1.2953346, 2.6563521, NaN,… ## $ mvp.dispersion.scaled <dbl> -0.5739472, 0.5332035, -0.2568744, 0.6806679, 0.… ## $ mean <dbl> 3.881082e-05, 1.079995e-03, 3.267138e-05, 4.7735… ## $ std <dbl> 5.573522e-03, 3.173095e-02, 5.634017e-03, 8.0407… ## $ ensembl_version <chr> "ENSG00000223972.5", "ENSG00000227232.5", "ENSG0… ## $ feature_is_filtered <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,… ## $ feature_name <fct> DDX11L1, WASH7P, MIR6859-1, MIR1302-2HG, MIR1302… ## $ feature_reference <fct> NCBITaxon:9606, NCBITaxon:9606, NCBITaxon:9606, … ## $ feature_biotype <fct> gene, gene, gene, gene, gene, gene, gene, gene, …

Finally, we must also identify where the cell type annotations are stored within the object’s metadata. Using dplyr, you can glimpse the metadata and see that the metadata slot cell_type contains the cell types we are interested in.

# Check metadata dplyr::glimpse(seurat_obj[[]]) # Output ## Rows: 13,497 ## Columns: 26 ## $ assay_ontology_term_id <fct> EFO:0009922, EFO:0009922, EFO… ## $ donor_id <fct> TSP9, TSP9, TSP9, TSP9, TSP9,… ## $ anatomical_information <fct> exocrine, exocrine, exocrine,… ## $ nCounts_RNA_UMIs <dbl> 26642, 4053, 8439, 7967, 9405… ## $ nFeaturess_RNA <dbl> 3833, 1624, 3067, 1484, 3273,… ## $ cell_ontology_class <fct> pancreatic acinar cell, t cel… ## $ free_annotation <fct> pancreatic acing cell, T cell… ## $ manually_annotated <lgl> TRUE, TRUE, TRUE, TRUE, TRUE,… ## $ compartment <fct> epithelial, immune, endotheli… ## $ sex_ontology_term_id <fct> PATO:0000384, PATO:0000384, P… ## $ disease_ontology_term_id <fct> PATO:0000461, PATO:0000461, P… ## $ is_primary_data <lgl> FALSE, FALSE, FALSE, FALSE, F… ## $ organism_ontology_term_id <fct> NCBITaxon:9606, NCBITaxon:960… ## $ suspension_type <fct> cell, cell, cell, cell, cell,… ## $ cell_type_ontology_term_id <fct> CL:0002064, CL:0000084, CL:00… ## $ tissue_ontology_term_id <fct> UBERON:0000017, UBERON:000001… ## $ development_stage_ontology_term_id <fct> HsapDv:0000131, HsapDv:000013… ## $ self_reported_ethnicity_ontology_term_id <fct> HANCESTRO:0014, HANCESTRO:001… ## $ cell_type <fct> pancreatic acinar cell, T cel… ## $ assay <fct> 10x 3' v3, 10x 3' v3, 10x 3' … ## $ disease <fct> normal, normal, normal, norma… ## $ organism <fct> Homo sapiens, Homo sapiens, H… ## $ sex <fct> male, male, male, male, male,… ## $ tissue <fct> exocrine pancreas, exocrine p… ## $ self_reported_ethnicity <fct> Hispanic or Latin American, H… ## $ development_stage <fct> 37-year-old human stage, 37-y…

The Xenium Panel Designer will only accept files which are 500 MB or smaller. This should be enough to accommodate approximately 50,000 cells (although this may vary). If your output file is larger, you will need to subsample cells.

We recommend only subsetting abundant cell types to ensure all rare cell barcodes are retained.

We will begin by assuming all cells will fit in the final output (subsample_rate = 1), but if you get a warning at the end of this process, you may need to come back and reduce the subsample_rate.

subsample_rate <- 1 subset <- 1:ncol(seurat_obj) if (subsample_rate < 1) { subset <- sample(subset, subsample_rate * length(subset)) }

In the next step, we will save the files in MEX format. Two options are provided - 1) with DropletUtils or 2) with baseR functions. Select one of these methods and then continue to the last step to save cell type annotations.

One way to convert your Seurat object to MEX format is to use DropletUtils.

# Install DropletUtils if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") if (!require("DropletUtils", quietly = TRUE)) BiocManager::install("DropletUtils") # Load library library(DropletUtils)

We will use the write10xCounts function to save our raw counts to a folder called reference_data/, using the location for the raw counts, gene symbols, and Ensembl IDs that we previously identified.

# Run function write10xCounts( "reference_data", GetAssayData(seurat_obj, assay="RNA", slot="counts")[, subset], gene.id = rownames(seurat_obj), gene.symbol = GetAssay(seurat_obj)@meta.features[["feature_name"]], barcodes = colnames(seurat_obj)[subset], type = "sparse", version = "3" ) # List the files list.files("reference_data") # Output #[1] "barcodes.tsv.gz" "features.tsv.gz" "matrix.mtx.gz"

Alternatively, you can do the equivalent format conversion with either base R or data.table using the following function:

# Define function writeCounts <- function(out_dir, data, barcodes = colnames(data), gene.id = rownames(data), gene.symbol = rownames(data), feature.type = "Gene Expression", subset = 1:length(barcodes)) { require("R.utils") require("Matrix") if (file.exists(out_dir) || (dir.exists(out_dir) && length(list.files(out_dir)) > 0)) { stop("The specified output directory already exists! Not overwriting") } dir.create(out_dir, recursive = TRUE) if (require("data.table")) { data.table::fwrite( data.table::data.table(barcode = barcodes[subset]), file.path(out_dir, "barcodes.tsv.gz"), col.names = FALSE ) data.table::fwrite( data.table::data.table( gene_id = gene.id, gene_symbol = gene.symbol, feature_type = feature.type ), file.path(out_dir, "features.tsv.gz"), col.names = FALSE ) } else { write.table( data.frame(barcode = barcodes[subset]), gzfile(file.path(out_dir, "barcodes.tsv.gz")), sep = "\t", quote = FALSE, col.names = FALSE, row.names = FALSE ) write.table( data.frame( gene_id = gene.id, gene_symbol = gene.symbol, feature_type = feature.type ), file.path(out_dir, "features.tsv.gz"), sep = "\t", quote = FALSE, col.names = FALSE, row.names = FALSE ) } Matrix::writeMM(data[, subset], file.path(out_dir, "matrix.mtx")) R.utils::gzip(file.path(out_dir, "matrix.mtx"), remove = TRUE) }

The function can be executed as so, which should generate the same outputs as DropletUtils.

# Run function writeCounts( "reference_data", GetAssayData(seurat_obj, assay="RNA", slot="counts"), gene.id = rownames(seurat_obj), gene.symbol = GetAssay(seurat_obj)@meta.features[["feature_name"]], feature.type = GetAssay(seurat_obj)@meta.features[["feature_type"]], barcodes = colnames(seurat_obj), subset = subset ) # List files list.files("reference_data") # Output #[1] "barcodes.tsv.gz" "features.tsv.gz" "matrix.mtx.gz"

Finally, we will save cell type information to a file so the Xenium Panel Designer knows what cell type to associate with each barcode. Earlier, we identified the cell type location. We need to write it to the same directory as the MEX data and compress the output directory. The function below is written to output a zip file.

# Define function bundleOutputs <- function(out_dir, data, barcodes = colnames(data), cell_type = "cell_type", subset = 1:length(barcodes)) { if (require("data.table", quietly = TRUE)) { data.table::fwrite( data.table::data.table( barcode = barcodes, annotation = unlist(seurat_obj[[cell_type]]) )[subset, ], file.path(out_dir, "annotations.csv") ) } else { write.table( data.frame( barcode = barcodes, annotation = unlist(seurat_obj[[cell_type]]) )[subset, ], file.path(out_dir, "annotations.csv"), sep = ",", row.names = FALSE ) } bundle <- file.path(out_dir, paste0(basename(out_dir), ".zip")) utils::zip( bundle, list.files(out_dir, full.names = TRUE), zip = "zip" ) if (file.info(bundle)$size / 1e6 > 500) { warning("The output file is more than 500 MB and will need to be subset further.") } }

The function can be run as shown below. If you get a warning that your file is too large, go back to the subset data step.

# Run function bundleOutputs(out_dir = "reference_data", data = seurat_obj, subset = subset)

Check your working directory for a reference_data directory, which contains the files listed below (example dataset output should be 280.9 MB).

# List files list.files("reference_data") # Output #[1] "annotations.csv" "barcodes.tsv.gz" "features.tsv.gz" "matrix.mtx.gz" "reference_data.zip"

If there is no warning, you are ready to upload the .zip file to the Xenium Panel Designer to build your panel. Learn more about the Xenium custom panel design process here.

The first step to converting h5ad to MEX is to identify which keys store the relevant data in the h5ad file. You must identify the observation keys for:

  1. Cell barcodes (cell barcodes are usually the index)
  2. Cell types

You must also identify the variable keys for:

  1. Gene name
  2. Gene ID

First, we will show an example of how to identify these keys in an example AnnData object from CELLxGENE. Then, we provide a function to then use this information to write a zipped file ready for upload to the Xenium Panel Designer.

All of the code examples in this section of the guide are in the Python programming language.

For this example, we have provided a small h5ad file obtained from Tabula Sapiens (download data above). This file has been filtered to only have cells from the pancreas. In the next set of steps, we will identify the required keys in this file that will be used to convert to MEX format.

# Load the h5ad file using h5py import h5py # Edit the file path in this command to point to the h5ad file on your computer obj = h5py.File("local.h5ad")

The Xenium Panel Designer requires unnormalized data. It is common to put normalized data in the X field and retain the raw counts in the raw field.

Taking a look at the attributes of this h5ad file, we can see that there is both X and raw:

# Print keys obj.keys() # Output # <KeysViewHDF5 ['X', 'layers', 'obs', 'obsm', 'obsp', 'raw', 'uns', 'var', 'varm', 'varp']>

Now use scanpy to load the object. If the matrix is large, this could take a few minutes.

# Load the h5ad file using scanpy import scanpy as sc # Edit the file path in this command to point to the h5ad file on your computer ad = sc.read_h5ad("local.h5ad")

Since we have identified that this h5ad file has a raw section, extract it to a new AnnData object.

raw_ad = ad.raw.to_adata()

Next, inspect the obs and var fields. We are looking for columns that contain gene common names and Ensembl gene IDs in var and cell types and barcodes in obs, which will be used to save the data in MEX format.

# Print field contents raw_ad.obs.columns # Output #Index(['assay_ontology_term_id', 'donor_id', 'anatomical_information', #'n_counts_UMIs', 'n_genes', 'cell_ontology_class', 'free_annotation', #'manually_annotated', 'compartment', 'sex_ontology_term_id', #'disease_ontology_term_id', 'is_primary_data', #'organism_ontology_term_id', 'suspension_type', #'cell_type_ontology_term_id', 'tissue_ontology_term_id', #'development_stage_ontology_term_id', #'self_reported_ethnicity_ontology_term_id', 'cell_type', 'assay', #'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', #'development_stage'], # dtype='object')

This tells us that the cell_type_key is cell_type:

# Print information for the first row raw_ad.obs[["cell_type"]].head(1) # Output # cell_type #cell_id #AAACCCACAGCTATTG_TSP9_Pancreas_exocrine_10X_1_1... pancreatic acinar cell

For the cell barcodes, since they are the index of the obs dataframe, we do not need to pull a key name. This tells us that the gene_id_key is ensemblid and the gene_name_key is feature_name.

# Print information for the first row raw_ad.var.head(1) # Output # feature_type ensembl_version feature_name feature_reference feature_biotype #ensemblid #ENSG00000223972 Gene Expression ENSG00000223972.5 DDX11L1 NCBITaxon:9606 gene

Now we can go ahead and convert to MEX. The function below is written to output a zip file.

# Import libraries import pandas as pd import scipy import zipfile import tempfile import os import gzip # Define function def h5ad_to_10x(ad, gene_id_key="gene_id", gene_name_key="gene_name", cell_type_key="cell_type", output_path="matrix.zip", barcode_key=None, subsample_rate=None): """ Convert an AnnData h5ad file to 10x format, then produce a zipfile of the final matrix. Arguments: ad: A scanpy AnnData object. gene_id_key: The key that the gene IDs are under. gene_name_key: The key that the gene names are under. cell_type_key: The key that the cell types are under. barcode_key: Optional key that the barcodes are under. If not set, will assume they are the index of `obs`. output_path: Path to write the zipfile to. subsample_rate: Optional argument for subsampling. If provided, should be between 0 and 1. """ if subsample_rate: sc.pp.subsample(ad, subsample_rate) genes = ad.var.reset_index()[[gene_id_key, gene_name_key]] genes["feature_type"] = ["Gene Expression"] * len(genes) if barcode_key: barcodes = ad.obs[[barcode_key]] else: barcodes = pd.DataFrame(ad.obs.index) celltypes = ad.obs[[cell_type_key]].reset_index() celltypes.columns = ["barcode", "annotation"] with tempfile.TemporaryDirectory() as tmp_dir: with gzip.open(os.path.join(tmp_dir, "matrix.mtx.gz"), "w") as handle: scipy.io.mmwrite(handle, ad.X.T) genes.to_csv(os.path.join(tmp_dir, "features.tsv.gz"), sep="\t", index=False, header=False, compression="gzip") barcodes.to_csv(os.path.join(tmp_dir, "barcodes.tsv.gz"), sep="\t", index=False, header=False, compression="gzip") celltypes.to_csv(os.path.join(tmp_dir, "celltypes.csv"), index=False) with zipfile.ZipFile(output_path, "w") as zip_handle: for file in ["matrix.mtx.gz", "features.tsv.gz", "barcodes.tsv.gz", "celltypes.csv"]: zip_handle.write(os.path.join(tmp_dir, file), arcname=file)
# Run function to save the matrix as a MEX file using the keys we identified earlier h5ad_to_10x(raw_ad, gene_id_key="ensemblid", gene_name_key="feature_name", cell_type_key="cell_type")

Check your working directory for a file called matrix.zip (example dataset output should be 152.7 MB).

The resulting zipfile should be loadable by the scanpy read_10x_mtx function.

# List content information import os os.system("unzip -l matrix.zip") # Output #Archive: matrix.zip # Length Date Time Name #--------- ---------- ----- ---- #151244754 11-09-2023 12:14 matrix.mtx.gz # 479701 11-09-2023 12:14 features.tsv.gz # 72881 11-09-2023 12:14 barcodes.tsv.gz # 941922 11-09-2023 12:14 celltypes.csv #--------- ------- #152739258 4 files
# Read the matrix file with tempfile.TemporaryDirectory() as tmp_dir: with zipfile.ZipFile("matrix.zip", "r") as zip_handle: zip_handle.extractall(tmp_dir) mtx = sc.read_10x_mtx(tmp_dir)
# Convert the matrix to a data frame and print the top mtx.to_df().head() # Output # DDX11L1 WASH7P MIR6859-1 MIR1302-2HG ... MT-TE MT-CYB MT-TT MT-TP #AAACCCACAGCTATTG_TSP9_Pancreas_exocrine_10X_1_1... 0.0 0.0 0.0 0.0 ... 0.0 251.0 0.0 3.0 #AAACCCAGTCGTATGT_TSP9_Pancreas_exocrine_10X_1_1... 0.0 0.0 0.0 0.0 ... 0.0 53.0 0.0 0.0 #AAACCCAGTCTGGTTA_TSP9_Pancreas_exocrine_10X_1_1... 0.0 0.0 0.0 0.0 ... 0.0 91.0 0.0 0.0 #AAACCCATCAAGCCGC_TSP9_Pancreas_exocrine_10X_1_1... 0.0 0.0 0.0 0.0 ... 0.0 351.0 0.0 0.0 #AAACCCATCGGCATAT_TSP9_Pancreas_exocrine_10X_1_1... 0.0 0.0 0.0 0.0 ... 0.0 67.0 0.0 0.0 # #[5 rows x 58604 columns]

Now you are ready to upload the .zip file to the Xenium Panel Designer to build your panel. Learn more about the Xenium custom panel design process here.

Stay connected with latest technical workflow and software updatesSubscribe to newsletter