Performing 3D Nucleus Segmentation With Cellpose and Generating a Feature-cell Matrix

Dec 6, 2022

Share via:

Note: 10x Genomics does not provide support for community-developed tools and makes no guarantees regarding their function or performance. Please contact tool developers with any questions. If you have feedback about Analysis Guides, please email analysis-guides@10xgenomics.com.

In most cases, the use of 10x Cell Segmentation Staining delivers the most accurate cell segmentation results, because in addition to identifying nuclear boundaries using DAPI, it also uses a number of protein stains to accurately deduce cell boundaries. This article presents an alternative approach that attempts 3D nuclear segmentation using only DAPI signal, followed by nuclear expansion heuristic for cell boundaries.

Xenium is a new in situ platform from 10x Genomics that quantifies spatial gene expression at subcellular resolution. In addition to automated biochemistry and image acquisition, the platform also performs onboard data analysis so raw images are distilled into data points for cells and transcripts.

As part of the onboard data analysis, Xenium automatically performs nucleus segmentation using a custom neural network. The analysis pipeline takes DAPI signals from the 3D image stack into account but ultimately produces a flattened 2D nucleus segmentation mask. It then expands the nuclear boundary by 5 microns in all directions to approximate the cell boundary, and assigns transcripts to a cell if it falls within the cell boundary.

Because the Xenium cell segmentation mask is 2D, transcripts are assigned to extruded 2D shapes, based solely on their x and y coordinates. The net result is akin to using cookie cutters on a thin layer of cookie dough.

Customers may wish to perform their own cell segmentation using a third-party tool if they are not satisfied with Xenium's default cell segmentation and transcript mapping. Xenium Ranger allows importing of 2D third-party segmentation results to create a Xenium bundle that can be explored in Xenium Explorer and read by most spatial and single-cell analysis packages. For true 3D segmentation this analysis guide uses Cellpose 2.1 and provides a path to create a feature-cell matrix that can be used for downstream data analysis.

Important

This analysis guide is compatible with data generated by the Xenium Onboard Analysis v1.0 pipeline.

This analysis guide requires the following 2 files from the Xenium output folder:

morphology.ome.tif
transcripts.parquet

Once the requisite libraries are installed, the first section will show how to perform nucleus segmentation with Cellpose 2.1 using Xenium's DAPI TIFF image. The result will be a numpy data dictionary (*.ome_seg.npy) and a nuclei masked image (*.ome_cp_masks.tif).

The second section will show how to combine transcripts' 3D locations (from transcripts.parquet) with Cellpose nucleus segmentation results to assign transcripts to cells using custom code. The final result is a feature-cell matrix in MTX format that is compatible with popular third-party tools such as Seurat and Scanpy.

Miniconda

This article will use a wide variety of third-party Python libraries. We recommend using Miniconda to install them, in addition to creating two conda environments that will be used throughout this guide. Miniconda can be installed by following the directions at: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html

Python Libraries

tifffile library is popular for reading and writing TIFF image files. In particular, it supports the OME-TIFF multi-page format that is produced by Xenium.
pyarrow library will be used to read files in Apache Parquet format.
pandas library has data structures that make it easy to work with mixed type tabular data.
scipy library has support for sparse matrices and it will be used to produce a feature-cell matrix in MTX format.

These libraries can be installed by running the following shell commands


# Update conda
conda update conda

# Create the xenium conda environment
conda create --name xenium python=3.9.12

# Activate the xenium conda environment
conda activate xenium

# Install libraries
conda install tifffile
conda install -c conda-forge pyarrow
conda install pandas
conda install scipy

# Go back to base environment
conda deactivate

After running the commands above, you should now have a xenium conda environment with the following installed:

python 3.9.12
tifffile 2021.7.2
pyarrow 4.0.0
pandas 1.4.3
scipy 1.7.3

Cellpose 2.1

Cellpose is a generalist algorithm for segmentation. The website at https://www.cellpose.org/ has a drag and drop UI, but it is limited for several reasons:

Only runs the older Cellpose 1.0
Input image is limited to <10MB, and up to 512x512 pixels.

Due to the aforementioned limitations, we will install the Python library and create a script to perform nucleus segmentation. Cellpose 2.1 installation steps are documented here, and summarized by the following shell commands:


conda create --name cellpose python=3.8

# Activate the cellpose environment
conda activate cellpose

# Install Cellpose
python -m pip install cellpose

# Go back to base environment
conda deactivate

You should now have a cellpose conda environment with the following installed:

python 3.8.13
cellpose 2.1.0

File Format Overview – morphology.ome.tif

The morphology.ome.tif image produced by Xenium is an image pyramid that contains 7 copies of the image stack at varying resolutions. Each level contains a 3D image stack with approximately 10 z-slices This concept is illustrated in the following figure.

Here's a video game analogy to explain the rationale behind the image pyramid. In games, 3D models often have variable levels of detail (LOD). Up close, a character model may consist of thousands of polygons. However, when that same character is 1/4 mile away in the game world, it can probably be represented by <100 polygons, and players won’t notice the reduction in detail. This makes it more efficient to render the whole scene.

Image generated using OpenSG 2 Vision.

With Xenium Explorer, the browser pulls from different levels of morphology.ome.tif depending on the level of zoom, and displays only what you need to see to remain quick and responsive despite the torrent of data. If you are looking at the whole tissue, you can probably get by with one of the lower resolution images.

Extract a Single Level from morphology.ome.tif

Cellpose can handle a 3D image stack with (x, y, z) info, but it is not compatible with an image pyramid. As such, it is necessary to extract a single level from morphology.ome.tif before proceeding further.

The choice of level will affect the segmentation result. A higher resolution input will result in more detailed segmentation at the cost of longer runtime.

Input Resolution	10 x 1424 x 1550	10 x 5696 x 6200
Segmentation Result
Segmentation Runtime	162 min. 55 sec.	300 min. 34 sec.

Once you decide which level to extract, the following Python script can be used to pull out a specific level, and save it as a separate image. To run the script, please do the following:

Copy and paste the code to a new text file.


#!/usr/bin/env python

import tifffile

# Variable 'LEVEL' determines the level to extract. It ranges from 0 (highest
# resolution) to 6 (lowest resolution) for morphology.ome.tif
LEVEL = 2

with tifffile.TiffFile('morphology.ome.tif') as tif:
    image = tif.series[0].levels[LEVEL].asarray()

tifffile.imwrite(
    'level_'+str(LEVEL)+'_morphology.ome.tif',
    image,
    photometric='minisblack',
    dtype='uint16',
    tile=(1024, 1024),
    compression='JPEG_2000_LOSSY',
    metadata={'axes': 'ZYX'},
)

Modify the LEVEL variable (line 7) to the desired value.
Save the file with a *.py extension (e.g. extract_tiff.py).
Give the script execute permission.


# Give files with .py extension execute permission
chmod 755 *.py

Activate the xenium conda environment before running the script.


# Activate the xenium conda environment
conda activate xenium

# Run Python script
./extract_tiff.py

Running Cellpose 2.1

Command-line options for Cellpose are documented at https://cellpose.readthedocs.io/en/latest/command.html#command-line-examples. In particular, we want to highlight these parameters:

--use_gpu: If you have access to a machine with GPU, there is potential for performance improvement. We did not test this option, and all runtimes provided in this analysis guide come from running Cellpose on a standard server.
--pretrained_model nuclei: During testing, we did not try training our own model; we used the pre-trained nuclei model. Cellpose will automatically download the model the first time you use it.
--do_3D: Cellpose is able to perform 3D segmentation. Xenium outputs an image stack with approximately 10 z-slices that are spaced 3 microns apart, so we enable this option.
--diameter: This parameter specifies target diameter in pixels, and can greatly affect segmentation results. Unfortunately, Cellpose is not able to automatically estimate a value when it is run in 3D mode, so some experimentation is needed to determine the optimal setting. In the table below, you can see how much segmentation results and runtimes can vary, given an identical Level 2 input image.

`--diameter` setting	8	16	24
Segmentation Result
Segmentation Runtime	788 min. 50 sec.	300 min. 34 sec.	300 min. 35 sec.

For mouse brain data, setting the --diameter parameter to 7-10 microns seems to yield good results. The optimal setting will obviously vary with other types of tissue, and it's important to convert the target diameter from microns to pixels, depending on which level was extracted from the image pyramid.

The following python script will print out the image description of morphology.ome.tif. This script should be run using the xenium conda environment.


#!/usr/bin/env python

import tifffile

with tifffile.TiffFile('morphology.ome.tif') as tif:
    for tag in tif.pages[0].tags.values():
        if tag.name == "ImageDescription":
            print(tag.name+":", tag.value)

In particular, the description includes the following line showing that at level 0, each pixel is 0.2125 micron x 0.2125 micron in (x, y).


<Pixels DimensionOrder="XYZCT" ID="Pixels:0" SizeC="1" SizeT="1" SizeX="22784" SizeY="24800" SizeZ="10" Type="uint16" PhysicalSizeX="0.2125" PhysicalSizeY="0.2125" PhysicalSizeZ="3.0">

With that information, here is a table that lists pixel size at various levels of the image pyramid:

Level	Pixel size (micron)
0	0.2125
1	0.425
2	0.85
3	1.7
4	3.4
5	6.8
6	13.6

Putting everything together, the following is a sample Cellpose command that can be run from the shell:


# Activate the cellpose environment
conda activate cellpose

# Cellpose 2.1 command
python -m cellpose --dir <ABS_PATH_OF_IMG> --pretrained_model nuclei --chan 0 --chan2 0 --img_filter _morphology.ome --diameter <DIAMETER_PIXEL> --do_3D --save_tif --verbose

# Deactivate cellpose environment
conda deactivate

Warning: When the resolution of the input image stack is high, Cellpose can crash with the following error: OverflowError: serializing a bytes object larger than 4 GiB requires pickle protocol 4 or higher

Ultimately, this error is caused by a known issue in numpy (https://github.com/numpy/numpy/issues/18784) which hard-codes the use of pickle protocol 3. The newer protocol 4 has been available since Python 3.4, and because we specified Python 3.8 in the cellpose conda environment, it should be possible to use protocol 4 and circumvent the 4GB file size limitation. Let's go ahead and make a minor source code modification to numpy.

First, find where your numpy library is installed by running the following shell command. It will display directories where Python libraries are found:


conda activate cellpose
python -c "import sys; print(sys.path)"

If you use Miniconda to install Python libraries, and set up a cellpose conda environment as suggested above, the file that needs to be edited should be located at <SOME_PATH_PREFIX>/miniconda3/envs/cellpose/lib/python3.8/site-packages/numpy/lib/format.py

Use your favorite text editor, and change line 679 of format.py, located in the write_array() function,

FROM


        pickle.dump(array, fp, protocol=3, **pickle_kwargs)


        pickle.dump(array, fp, protocol=pickle.HIGHEST_PROTOCOL, **pickle_kwargs)

File Format Overview – transcripts.parquet

Xenium outputs transcript data in 3 different formats:

transcripts.csv.gz
transcripts.parquet
transcripts.zarr.zip

Depending on the choice of downstream analysis tools, one format may be easier or faster to process than another.

This article will focus on the Apache Parquet file format. Unlike the row-based CSV format where data is stored line by line (row by row), Parquet is column-based. Because data in each column usually share the same data type, Parquet files can be compressed and decompressed very efficiently. In general, this modern file format is much smaller and faster relative to CSV. For a more in-depth comparison between Parquet and CSV, please take a look at this article.

Use the following python script to quickly determine the column names and their associated data types in transcripts.parquet. The script will also print out the first 10 records. Please run this script using the xenium conda environment.


#!/usr/bin/env python3

import pandas as pd

data_frame = pd.read_parquet('transcripts.parquet')
print(data_frame.dtypes)
print(data_frame[:10])

Running the python script returns the following, showing that transcripts.parquet contains 8 columns.


transcript_id        uint64
cell_id               int32
overlaps_nucleus      uint8
feature_name         object
x_location          float32
y_location          float32
z_location          float32
qv                  float32
dtype: object
     transcript_id  cell_id  overlaps_nucleus feature_name  x_location  y_location  z_location         qv
0  281474976710656      153                 1     b'Chst9'  467.284912  723.035156   10.867549  13.371499
1  281474976710657       85                 1   b'Bhlhe40'  618.211121  717.488708   14.101116  12.122915
2  281474976710658      191                 0      b'Wfs1'  598.623840  782.937134   13.952497  40.000000
3  281474976710659      203                 0      b'Wfs1'  610.280945  798.482544   13.899248  40.000000
4  281474976710660      199                 0      b'Dkk3'  532.640747  800.270020   14.038778  40.000000
5  281474976710661      205                 0      b'Gjb2'  412.885162  813.587158   13.887715  40.000000
6  281474976710662       -1                 0      b'Dkk3'  535.707947  861.141846   14.190361  12.725308
7  281474976710663       -1                 0      b'Lyz2'  340.266418  865.291199   13.883341  40.000000
8  281474976710664       -1                 0      b'Car4'  211.336823  879.179504   13.700431  40.000000
9  281474976710665     1342                 0      b'Lyz2'  367.774048  890.858215   13.799179  40.000000

Here are a few quick notes about some of these columns:

cell_id: If a transcript is not associated with a cell, it will get -1. Cell assignment is based on Xenium's default 2D segmentation mask, and is likely irrelevant if we want to use the 3D segmentation result from Cellpose.
x_location, y_location, z_location: These 3 columns store the (x, y, z) coordinates of the transcript in microns. The origin is found in the upper-left corner of the bottom z-slice, as the following image illustrates.

qv: Phred-scaled quality score of all decoded transcripts. It is up to the customer to determine a suitable Q-Score threshold when analyzing Xenium data. At 10x Genomics, we typically filter out transcripts with Q-Score < 20.

Reading Cellpose Segmentation Mask

Cellpose produces a *.ome_seg.npy output file that contains the nucleus segmentation results. This file is documented on https://cellpose.readthedocs.io/en/latest/outputs.html.

It is also possible to list all the keys, the datatypes, and dimensions using the following python script. Please run this script using the xenium conda environment.


#!/usr/bin/env python3

import numpy as np
from glob import glob

fnames = glob('*morphology.ome_seg.npy')

for f in fnames:
    dat = np.load(f, allow_pickle=True).item()

    for key in dat.keys():
        dtype = type(dat[key])
        print("Key:", key)
        print("Dtype:", dtype)

        if isinstance(dat[key], np.ndarray):
            print("Dimension:", dat[key].shape)
        elif isinstance(dat[key], list):
            print("Length:", len(dat[key]))

        # empty line between keys
        print()

Running the script above would produce an output such as the following:


Key: outlines
Dtype: <class 'numpy.ndarray'>
Dimension: (12, 3125, 2861)

Key: masks
Dtype: <class 'numpy.ndarray'>
Dimension: (12, 3125, 2861)

Key: chan_choose
Dtype: <class 'list'>
Length: 2

Key: img
Dtype: <class 'numpy.ndarray'>
Dimension: (12, 3125, 2861)

Key: ismanual
Dtype: <class 'numpy.ndarray'>
Dimension: (32845,)

Key: filename
Dtype: <class 'str'>

Key: flows
Dtype: <class 'list'>
Length: 5

Key: est_diam
Dtype: <class 'float'>

Of these, the ndarray stored under the "masks" key will be most relevant to us, since every (z-slice, y-pixel, x-pixel) in the original image that is given to Cellpose will be assigned to a unique cell ID. If 0 is assigned, then that location is dark in the original image and is not associated with a cell based on DAPI nucleus segmentation.

Mapping Transcripts

Putting everything together, the Python script below will perform these tasks:

Process all transcripts from transcripts.parquet, and assign them to cells based on Cellpose nucleus segmentation results.
Perform a neighborhood search for unassigned transcripts. Since Cellpose only performs nucleus segmentation from DAPI signals, many transcripts that are located in the cytoplasm will be in unassigned space. The script uses a heuristic that associates those transcripts with the nearest nucleus if the Euclidean distance is within a user-specified limit. The default nuclear expansion distance is 10 microns, and this can be adjusted via the -nuc_exp parameter.
Generate a feature-cell matrix and output that matrix in MTX format that is compatible with Seurat and Scanpy.

Running the script with -h would display the help text that lists the required and optional parameters. This script should be run using the xenium conda environment. It's also important to specify the correct -pix_size argument depending on the level that was extracted from the image pyramid in the nucleus segmentation section.


#!/usr/bin/env python3

import argparse
import csv
import os
import sys
import numpy as np
import pandas as pd
import re
import scipy.sparse as sparse
import scipy.io as sio
import subprocess


# Define constant.
# z-slices are 3 microns apart in morphology.ome.tif
Z_SLICE_MICRON = 3


def main():
    # Parse input arguments.
    args = parse_args()

    # Check for existence of input file.
    if (not os.path.exists(args.cellpose)):
        print("The specified Cellpose output (%s) does not exist!" % args.cellpose)
        sys.exit(0)
    if (not os.path.exists(args.transcript)):
        print("The specified transcripts.parquet file (%s) does not exist!" % args.transcript)
        sys.exit(0)

    # Check if output folder already exist.
    if (os.path.exists(args.out)):
        print("The specified output folder (%s) already exists!" % args.out)
        sys.exit(0)

    # Define additional constants
    NUC_EXP_PIXEL = args.nuc_exp / args.pix_size
    NUC_EXP_SLICE = args.nuc_exp / Z_SLICE_MICRON

    # Read Cellpose segmentation mask
    seg_data = np.load(args.cellpose, allow_pickle=True).item()
    mask_array = seg_data['masks']
    # Use regular expression to extract dimensions from mask_array.shape
    m = re.match("\((?P<z_size>\d+), (?P<y_size>\d+), (?P<x_size>\d+)", str(mask_array.shape))
    mask_dims = { key:int(m.groupdict()[key]) for key in m.groupdict() }

    # Read 5 columns from transcripts Parquet file
    transcripts_df = pd.read_parquet(args.transcript,
                                     columns=["feature_name",
                                              "x_location",
                                              "y_location",
                                              "z_location",
                                              "qv"])


    # Find distinct set of features.
    features = np.unique(transcripts_df["feature_name"])

    # Create lookup dictionary
    feature_to_index = dict()
    for index, val in enumerate(features):
        feature_to_index[str(val, 'utf-8')] = index

    # Find distinct set of cells. Discard the first entry which is 0 (non-cell)
    cells = np.unique(mask_array)[1:]

    # Create a cells x features data frame, initialized with 0
    matrix = pd.DataFrame(0, index=range(len(features)), columns=cells, dtype=np.int32)


    # Iterate through all transcripts
    for index, row in transcripts_df.iterrows():
        if index % args.rep_int == 0:
            print(index, "transcripts processed.")

        feature = str(row['feature_name'], 'utf-8')
        x = row['x_location']
        y = row['y_location']
        z = row['z_location']
        qv = row['qv']

        # Ignore transcript below user-specified cutoff
        if qv < args.qv_cutoff:
            continue

        # Convert transcript locations from physical space to image space
        x_pixel = x / args.pix_size
        y_pixel = y / args.pix_size
        z_slice = z / Z_SLICE_MICRON

        # Add guard rails to make sure lookup falls within image boundaries.
        x_pixel = min(max(0, x_pixel), mask_dims["x_size"] - 1)
        y_pixel = min(max(0, y_pixel), mask_dims["y_size"] - 1)
        z_slice = min(max(0, z_slice), mask_dims["z_size"] - 1)

        # Look up cell_id assigned by Cellpose. Array is in ZYX order.
        cell_id = mask_array[round(z_slice)] [round(y_pixel)] [round(x_pixel)]

        # If cell_id is 0, Cellpose did not assign the pixel to a cell. Need to perform
        # neighborhood search. See if nearest nucleus is within user-specified distance.
        if cell_id == 0:
            # Define neighborhood boundary for 3D ndarray slicing. Take image boundary into
            # consideration to avoid negative index.
            z_neighborhood_min_slice = max(0, round(z_slice-NUC_EXP_SLICE))
            z_neighborhood_max_slice = min(mask_dims["z_size"], round(z_slice+NUC_EXP_SLICE+1))
            y_neighborhood_min_pixel = max(0, round(y_pixel-NUC_EXP_PIXEL))
            y_neighborhood_max_pixel = min(mask_dims["y_size"], round(y_pixel+NUC_EXP_PIXEL+1))
            x_neighborhood_min_pixel = max(0, round(x_pixel-NUC_EXP_PIXEL))
            x_neighborhood_max_pixel = min(mask_dims["x_size"], round(x_pixel+NUC_EXP_PIXEL+1))

            # Call helper function to see if nearest nucleus is within user-specified distance.
            cell_id = nearest_cell(args, x_pixel, y_pixel, z_slice,
                                   x_neighborhood_min_pixel,
                                   y_neighborhood_min_pixel,
                                   z_neighborhood_min_slice,
                                   mask_array[z_neighborhood_min_slice : z_neighborhood_max_slice,
                                              y_neighborhood_min_pixel : y_neighborhood_max_pixel,
                                              x_neighborhood_min_pixel : x_neighborhood_max_pixel])

        # If cell_id is not 0 at this point, it means the transcript is associated with a cell
        if cell_id != 0:
            # Increment count in feature-cell matrix
            matrix.at[feature_to_index[feature], cell_id] += 1

    # Call a helper function to create Seurat and Scanpy compatible MTX output
    write_sparse_mtx(args, matrix, cells, features)



#--------------------------
# Helper functions

def parse_args():
    """Parses command-line options for main()."""
    summary = 'Map Xenium transcripts to Cellpose segmentation result. \
               Generate Seurat/Scanpy-compatible feature-cell matrix.'

    parser = argparse.ArgumentParser(description=summary)
    requiredNamed = parser.add_argument_group('required named arguments')
    requiredNamed.add_argument('-cellpose',
                               required = True,
                               help="The path to the *.ome_seg.npy file produced " +
                                    "by Cellpose.")
    requiredNamed.add_argument('-transcript',
                               required = True,
                               help="The path to the transcripts.parquet file produced " +
                                    "by Xenium.")
    requiredNamed.add_argument('-out',
                               required = True,
                               help="The name of output folder in which feature-cell " +
                                    "matrix is written.")
    requiredNamed.add_argument('-pix_size',
                               required = True,
                               type=float,
                               help="The size of each pixel, in microns, for the " +
                                    "image that is passed to Cellpose for nucleus " +
                                    "segmentation.")
    parser.add_argument('-nuc_exp',
                        default='10.0',
                        type=float,
                        help="The expansion distance from the nuclear boundary, " +
                             "in microns, for cell boundary. (default: 10.0)")
    parser.add_argument('-qv_cutoff',
                        default='20.0',
                        type=float,
                        help="Ignore transcripts with QV score below this " +
                             "threshold. (default: 20.0)")
    parser.add_argument('-rep_int',
                        default='10000',
                        type=int,
                        help="Reporting interval. Will print message to stdout " +
                             "whenever the specified # of transcripts is processed. " +
                             "(default: 10000)")

    try:
        opts = parser.parse_args()
    except:
        sys.exit(0)

    return opts


def nearest_cell(args, x_pixel, y_pixel, z_slice,
                 x_neighborhood_min_pixel, y_neighborhood_min_pixel,
                 z_neighborhood_min_slice, mask_array):
    """Check if nearest nucleus is within user-specified distance.
       If function returns 0, it means no suitable nucleus was found."""

    # Initialize constants
    NUC_EXP_PIXEL = args.nuc_exp / args.pix_size
    # For Euclidean distance, we need to convert z-slice to z-micron
    SLICE_TO_PIXEL = Z_SLICE_MICRON / args.pix_size
    # When we take a neighborhood slice of mask_array, all indices start at (0,0,0).
    # This INDEX_SHIFT is necessary to reconstruct coordinates from original mask_array.
    INDEX_SHIFT = np.array([z_neighborhood_min_slice,
                            y_neighborhood_min_pixel,
                            x_neighborhood_min_pixel])

    min_dist = NUC_EXP_PIXEL
    cell_id = 0

    # Enumerate through all points in the neighborhood
    for index, cell in np.ndenumerate(mask_array):
        # Current point is not assigned to a nucleus.
        if cell == 0:
            continue
        # Current point IS assigned to a nucleus. But is it within NUC_EXP_PIXEL?
        else:
            img_loc = np.asarray(index, dtype=float) + INDEX_SHIFT
            # Convert from z-slice to "z-pixel"
            img_loc[0] *= SLICE_TO_PIXEL

            transcript_loc = np.array([z_slice * SLICE_TO_PIXEL, y_pixel, x_pixel])
            # Calculate Euclidean distance between 2 points
            dist = np.linalg.norm(transcript_loc - img_loc)

            if dist < min_dist:
                min_dist = dist
                cell_id = cell

    return cell_id


def write_sparse_mtx(args, matrix, cells, features):
    """Write feature-cell matrix in Seurat/Scanpy-compatible MTX format"""

    # Create the matrix folder.
    os.mkdir(args.out)

    # Convert matrix to scipy's COO sparse matrix.
    sparse_mat = sparse.coo_matrix(matrix.values)

    # Write matrix in MTX format.
    sio.mmwrite(args.out + "/matrix.mtx", sparse_mat)

    # Write cells as barcodes.tsv. File name is chosen to ensure
    # compatibility with Seurat/Scanpy.
    with open(args.out + "/barcodes.tsv", 'w', newline='') as tsvfile:
        writer = csv.writer(tsvfile, delimiter='\t', lineterminator='\n')
        for cell in cells:
            writer.writerow(["cell_" + str(cell)])

    # Write features as features.tsv. Write 3 columns to ensure
    # compatibility with Seurat/Scanpy.
    with open(args.out + "/features.tsv", 'w', newline='') as tsvfile:
        writer = csv.writer(tsvfile, delimiter='\t', lineterminator='\n')
        for f in features:
            feature = str(f, 'utf-8')
            if feature.startswith("NegControlProbe_") or feature.startswith("antisense_"):
                writer.writerow([feature, feature, "Negative Control Probe"])
            elif feature.startswith("NegControlCodeword_"):
                writer.writerow([feature, feature, "Negative Control Codeword"])
            elif feature.startswith("BLANK_"):
                writer.writerow([feature, feature, "Blank Codeword"])
            else:
                writer.writerow([feature, feature, "Gene Expression"])

    # Seurat expects all 3 files to be gzipped
    subprocess.run("gzip -f " + args.out + "/*", shell=True)



if __name__ == "__main__":
    main()

If the script above is saved to a file called map_transcripts.py, then a sample invocation might look like the following:


# Activate the xenium conda environment
conda activate xenium

# Map transcripts to cells
./map_transcripts.py -cellpose level_3_morphology.ome_seg.npy -transcript transcripts.parquet -out feature_cell_matrix -pix_size 1.7

The script spends the majority of time performing neighborhood searches, so the choice of -nuc_exp will greatly impact overall runtime. Larger nuclear expansion distances will allow more transcripts to be associated with a cell, but we expect a roughly cubic growth in runtime since neighborhoods expand in 3 dimensions.

The table below shows the relationship between nuclear expansion distance, cell-associated Q20 transcripts, and script runtime. These results are based on processing the first million transcripts, 806,508 of which have Q-Score >= 20, from a mouse brain dataset. Depending on the nuclear expansion distance, different numbers of transcripts can be associated with a cell (cell-associated Q20 transcript) as shown in the table below. The whole dataset has almost 20 million decoded transcripts, so we can expect a twenty-fold increase in runtime to map all transcripts in this mouse brain sample.

Nuclear Expansion Distance	Cell-Associated Q20 Transcripts	Script Runtime
0 micron	49,846	2 min. 23 sec.
5 microns	98,031	12 min. 24 sec.
10 microns	125,874	60 min. 14 sec.
15 microns	139,917	165 min. 7 sec.

After running the Python transcript mapping script, the resulting feature-cell matrix can be read by popular third-party tools such as Seurat and Scanpy.

To demonstrate compatibility, we were able to follow this Seurat vignette and use the Read10X() function to load the matrix. We produced the following plots, and it should be straightforward to cluster the cells, generate a UMAP projection, and find differentially expressed features.

Likewise, we were able to follow this Scanpy tutorial and use the read_10x_mtx() function to load the matrix. The following plot shows the 20 genes that have the highest fraction of counts in each cell, across all cells, and it should be straightforward to proceed further with the tutorial and run the standard suite of single cell data analysis with this feature-cell matrix.

If you use Cellpose in your publication, please cite Cellpose based on citation guidelines here.

Stay connected with latest technical workflow and software updatesSubscribe to newsletter