Analysis Guides/

Performing 3D Nucleus Segmentation With Cellpose and Generating a Feature-cell Matrix

Dec 6, 2022
Share via:

Note: 10x Genomics does not provide support for community-developed tools and makes no guarantees regarding their function or performance. Please contact tool developers with any questions. If you have feedback about Analysis Guides, please email [email protected].

Xenium is a new in situ platform from 10x Genomics that quantifies spatial gene expression at subcellular resolution. In addition to automated biochemistry and image acquisition, the platform also performs onboard data analysis so raw images are distilled into data points for cells and transcripts.

As part of the onboard data analysis, Xenium automatically performs nucleus segmentation using a custom neural network. The analysis pipeline takes DAPI signals from the 3D image stack into account but ultimately produces a flattened 2D nucleus segmentation mask. It then expands the nuclear boundary by 15 microns in all directions to approximate the cell boundary, and assigns transcripts to a cell if it falls within the cell boundary.

Because the Xenium cell segmentation mask is 2D, transcripts are assigned to extruded 2D shapes, based solely on their x and y coordinates. The net result is akin to using cookie cutters on a thin layer of cookie dough.

Customers who have some familiarity with Linux may wish to perform their own cell segmentation using a third-party tool if they are not satisfied with Xenium's default cell segmentation and transcript mapping. This analysis guide uses Cellpose 2.1 to perform true 3D segmentation, and provides a path to create a feature-cell matrix that can be used for downstream data analysis.


Important
This analysis guide is compatible with data generated by the Xenium Onboard Analysis v1.0 pipeline.

This analysis guide requires the following 2 files from the Xenium output folder:

  • morphology.ome.tif
  • transcripts.parquet

Once the requisite libraries are installed, the first section will show how to perform nucleus segmentation with Cellpose 2.1 using Xenium's DAPI TIFF image. The result will be a numpy data dictionary (*.ome_seg.npy) and a nuclei masked image (*.ome_cp_masks.tif).

The second section will show how to combine transcripts' 3D locations (from transcripts.parquet) with Cellpose nucleus segmentation results to assign transcripts to cells using custom code. The final result is a feature-cell matrix in MTX format that is compatible with popular third-party tools such as Seurat and Scanpy.

Miniconda

This article will use a wide variety of third-party Python libraries. We recommend using Miniconda to install them, in addition to creating two conda environments that will be used throughout this guide. Miniconda can be installed by following the directions at: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html

Python Libraries

  • tifffile library is popular for reading and writing TIFF image files. In particular, it supports the OME-TIFF multi-page format that is produced by Xenium.
  • pyarrow library will be used to read files in Apache Parquet format.
  • pandas library has data structures that make it easy to work with mixed type tabular data.
  • scipy library has support for sparse matrices and it will be used to produce a feature-cell matrix in MTX format.

These libraries can be installed by running the following shell commands

# Update conda conda update conda # Create the xenium conda environment conda create --name xenium python=3.9.12 # Activate the xenium conda environment conda activate xenium # Install libraries conda install tifffile conda install -c conda-forge pyarrow conda install pandas conda install scipy # Go back to base environment conda deactivate

After running the commands above, you should now have a xenium conda environment with the following installed:

  • python 3.9.12
  • tifffile 2021.7.2
  • pyarrow 4.0.0
  • pandas 1.4.3
  • scipy 1.7.3

Cellpose 2.1

Cellpose is a generalist algorithm for segmentation. The website at https://www.cellpose.org/ has a drag and drop UI, but it is limited for several reasons:

  • Only runs the older Cellpose 1.0
  • Input image is limited to <10MB, and up to 512x512 pixels.

Due to the aforementioned limitations, we will install the Python library and create a script to perform nucleus segmentation. Cellpose 2.1 installation steps are documented here, and summarized by the following shell commands:

conda create --name cellpose python=3.8 # Activate the cellpose environment conda activate cellpose # Install Cellpose python -m pip install cellpose # Go back to base environment conda deactivate

You should now have a cellpose conda environment with the following installed:

  • python 3.8.13
  • cellpose 2.1.0

File Format Overview – morphology.ome.tif

The morphology.ome.tif image produced by Xenium is an image pyramid that contains 7 copies of the image stack at varying resolutions. Each level contains a 3D image stack with approximately 10 z-slices This concept is illustrated in the following figure.

Here's a video game analogy to explain the rationale behind the image pyramid. In games, 3D models often have variable levels of detail (LOD). Up close, a character model may consist of thousands of polygons. However, when that same character is 1/4 mile away in the game world, it can probably be represented by <100 polygons, and players won’t notice the reduction in detail. This makes it more efficient to render the whole scene.

Image generated using OpenSG 2 Vision.

With Xenium Explorer, the browser pulls from different levels of morphology.ome.tif depending on the level of zoom, and displays only what you need to see to remain quick and responsive despite the torrent of data. If you are looking at the whole tissue, you can probably get by with one of the lower resolution images.

Extract a Single Level from morphology.ome.tif

Cellpose can handle a 3D image stack with (x, y, z) info, but it is not compatible with an image pyramid. As such, it is necessary to extract a single level from morphology.ome.tif before proceeding further.

The choice of level will affect the segmentation result. A higher resolution input will result in more detailed segmentation at the cost of longer runtime.

Input Resolution10 x 1424 x 155010 x 5696 x 6200
Segmentation Result
Segmentation Runtime162 min. 55 sec.300 min. 34 sec.

Once you decide which level to extract, the following Python script can be used to pull out a specific level, and save it as a separate image. To run the script, please do the following:

  1. Copy and paste the code to a new text file.
#!/usr/bin/env python import tifffile # Variable 'LEVEL' determines the level to extract. It ranges from 0 (highest # resolution) to 6 (lowest resolution) for morphology.ome.tif LEVEL = 2 with tifffile.TiffFile('morphology.ome.tif') as tif: image = tif.series[0].levels[LEVEL].asarray() tifffile.imwrite( 'level_'+str(LEVEL)+'_morphology.ome.tif', image, photometric='minisblack', dtype='uint16', tile=(1024, 1024), compression='JPEG_2000_LOSSY', metadata={'axes': 'ZYX'}, )
  1. Modify the LEVEL variable (line 7) to the desired value.
  2. Save the file with a *.py extension (e.g. extract_tiff.py).
  3. Give the script execute permission.
# Give files with .py extension execute permission chmod 755 *.py
  1. Activate the xenium conda environment before running the script.
# Activate the xenium conda environment conda activate xenium # Run Python script ./extract_tiff.py

Running Cellpose 2.1

Command-line options for Cellpose are documented at https://cellpose.readthedocs.io/en/latest/command.html#command-line-examples. In particular, we want to highlight these parameters:

  • --use_gpu: If you have access to a machine with GPU, there is potential for performance improvement. We did not test this option, and all runtimes provided in this analysis guide come from running Cellpose on a standard server.
  • --pretrained_model nuclei: During testing, we did not try training our own model; we used the pre-trained nuclei model. Cellpose will automatically download the model the first time you use it.
  • --do_3D: Cellpose is able to perform 3D segmentation. Xenium outputs an image stack with approximately 10 z-slices that are spaced 3 microns apart, so we enable this option.
  • --diameter: This parameter specifies target diameter in pixels, and can greatly affect segmentation results. Unfortunately, Cellpose is not able to automatically estimate a value when it is run in 3D mode, so some experimentation is needed to determine the optimal setting. In the table below, you can see how much segmentation results and runtimes can vary, given an identical Level 2 input image.
--diameter setting81624
Segmentation Result
Segmentation Runtime788 min. 50 sec.300 min. 34 sec.300 min. 35 sec.

For mouse brain data, setting the --diameter parameter to 7-10 microns seems to yield good results. The optimal setting will obviously vary with other types of tissue, and it's important to convert the target diameter from microns to pixels, depending on which level was extracted from the image pyramid.

The following python script will print out the image description of morphology.ome.tif. This script should be run using the xenium conda environment.

#!/usr/bin/env python import tifffile with tifffile.TiffFile('morphology.ome.tif') as tif: for tag in tif.pages[0].tags.values(): if tag.name == "ImageDescription": print(tag.name+":", tag.value)

In particular, the description includes the following line showing that at level 0, each pixel is 0.2125 micron x 0.2125 micron in (x, y).

<Pixels DimensionOrder="XYZCT" ID="Pixels:0" SizeC="1" SizeT="1" SizeX="22784" SizeY="24800" SizeZ="10" Type="uint16" PhysicalSizeX="0.2125" PhysicalSizeY="0.2125" PhysicalSizeZ="3.0">

With that information, here is a table that lists pixel size at various levels of the image pyramid:

LevelPixel size (micron)
00.2125
10.425
20.85
31.7
43.4
56.8
613.6

Putting everything together, the following is a sample Cellpose command that can be run from the shell:

# Activate the cellpose environment conda activate cellpose # Cellpose 2.1 command python -m cellpose --dir <ABS_PATH_OF_IMG> --pretrained_model nuclei --chan 0 --chan2 0 --img_filter _morphology.ome --diameter <DIAMETER_PIXEL> --do_3D --save_tif --verbose # Deactivate cellpose environment conda deactivate

Warning: When the resolution of the input image stack is high, Cellpose can crash with the following error: OverflowError: serializing a bytes object larger than 4 GiB requires pickle protocol 4 or higher

Ultimately, this error is caused by a known issue in numpy (https://github.com/numpy/numpy/issues/18784) which hard-codes the use of pickle protocol 3. The newer protocol 4 has been available since Python 3.4, and because we specified Python 3.8 in the cellpose conda environment, it should be possible to use protocol 4 and circumvent the 4GB file size limitation. Let's go ahead and make a minor source code modification to numpy.

First, find where your numpy library is installed by running the following shell command. It will display directories where Python libraries are found:

conda activate cellpose python -c "import sys; print(sys.path)"

If you use Miniconda to install Python libraries, and set up a cellpose conda environment as suggested above, the file that needs to be edited should be located at <SOME_PATH_PREFIX>/miniconda3/envs/cellpose/lib/python3.8/site-packages/numpy/lib/format.py

Use your favorite text editor, and change line 679 of format.py, located in the write_array() function,

FROM

pickle.dump(array, fp, protocol=3, **pickle_kwargs)

TO

pickle.dump(array, fp, protocol=pickle.HIGHEST_PROTOCOL, **pickle_kwargs)

File Format Overview – transcripts.parquet

Xenium outputs transcript data in 3 different formats:

  1. transcripts.csv.gz
  2. transcripts.parquet
  3. transcripts.zarr.zip

Depending on the choice of downstream analysis tools, one format may be easier or faster to process than another.

This article will focus on the Apache Parquet file format. Unlike the row-based CSV format where data is stored line by line (row by row), Parquet is column-based. Because data in each column usually share the same data type, Parquet files can be compressed and decompressed very efficiently. In general, this modern file format is much smaller and faster relative to CSV. For a more in-depth comparison between Parquet and CSV, please take a look at this article.

Use the following python script to quickly determine the column names and their associated data types in transcripts.parquet. The script will also print out the first 10 records. Please run this script using the xenium conda environment.

#!/usr/bin/env python3 import pandas as pd data_frame = pd.read_parquet('transcripts.parquet') print(data_frame.dtypes) print(data_frame[:10])

Running the python script returns the following, showing that transcripts.parquet contains 8 columns.

transcript_id uint64 cell_id int32 overlaps_nucleus uint8 feature_name object x_location float32 y_location float32 z_location float32 qv float32 dtype: object transcript_id cell_id overlaps_nucleus feature_name x_location y_location z_location qv 0 281474976710656 153 1 b'Chst9' 467.284912 723.035156 10.867549 13.371499 1 281474976710657 85 1 b'Bhlhe40' 618.211121 717.488708 14.101116 12.122915 2 281474976710658 191 0 b'Wfs1' 598.623840 782.937134 13.952497 40.000000 3 281474976710659 203 0 b'Wfs1' 610.280945 798.482544 13.899248 40.000000 4 281474976710660 199 0 b'Dkk3' 532.640747 800.270020 14.038778 40.000000 5 281474976710661 205 0 b'Gjb2' 412.885162 813.587158 13.887715 40.000000 6 281474976710662 -1 0 b'Dkk3' 535.707947 861.141846 14.190361 12.725308 7 281474976710663 -1 0 b'Lyz2' 340.266418 865.291199 13.883341 40.000000 8 281474976710664 -1 0 b'Car4' 211.336823 879.179504 13.700431 40.000000 9 281474976710665 1342 0 b'Lyz2' 367.774048 890.858215 13.799179 40.000000

Here are a few quick notes about some of these columns:

  • cell_id: If a transcript is not associated with a cell, it will get -1. Cell assignment is based on Xenium's default 2D segmentation mask, and is likely irrelevant if we want to use the 3D segmentation result from Cellpose.
  • x_location, y_location, z_location: These 3 columns store the (x, y, z) coordinates of the transcript in microns. The origin is found in the upper-left corner of the bottom z-slice, as the following image illustrates.
  • qv: Phred-scaled quality score of all decoded transcripts. It is up to the customer to determine a suitable Q-Score threshold when analyzing Xenium data. At 10x Genomics, we typically filter out transcripts with Q-Score < 20.

Reading Cellpose Segmentation Mask

Cellpose produces a *.ome_seg.npy output file that contains the nucleus segmentation results. This file is documented on https://cellpose.readthedocs.io/en/latest/outputs.html.

It is also possible to list all the keys, the datatypes, and dimensions using the following python script. Please run this script using the xenium conda environment.

#!/usr/bin/env python3 import numpy as np from glob import glob fnames = glob('*morphology.ome_seg.npy') for f in fnames: dat = np.load(f, allow_pickle=True).item() for key in dat.keys(): dtype = type(dat[key]) print("Key:", key) print("Dtype:", dtype) if isinstance(dat[key], np.ndarray): print("Dimension:", dat[key].shape) elif isinstance(dat[key], list): print("Length:", len(dat[key])) # empty line between keys print()

Running the script above would produce an output such as the following:

Key: outlines Dtype: <class 'numpy.ndarray'> Dimension: (12, 3125, 2861) Key: masks Dtype: <class 'numpy.ndarray'> Dimension: (12, 3125, 2861) Key: chan_choose Dtype: <class 'list'> Length: 2 Key: img Dtype: <class 'numpy.ndarray'> Dimension: (12, 3125, 2861) Key: ismanual Dtype: <class 'numpy.ndarray'> Dimension: (32845,) Key: filename Dtype: <class 'str'> Key: flows Dtype: <class 'list'> Length: 5 Key: est_diam Dtype: <class 'float'>

Of these, the ndarray stored under the "masks" key will be most relevant to us, since every (z-slice, y-pixel, x-pixel) in the original image that is given to Cellpose will be assigned to a unique cell ID. If 0 is assigned, then that location is dark in the original image and is not associated with a cell based on DAPI nucleus segmentation.

Mapping Transcripts

Putting everything together, the Python script below will perform these tasks:

  1. Process all transcripts from transcripts.parquet, and assign them to cells based on Cellpose nucleus segmentation results.
  2. Perform a neighborhood search for unassigned transcripts. Since Cellpose only performs nucleus segmentation from DAPI signals, many transcripts that are located in the cytoplasm will be in unassigned space. The script uses a heuristic that associates those transcripts with the nearest nucleus if the Euclidean distance is within a user-specified limit. The default nuclear expansion distance is 10 microns, and this can be adjusted via the -nuc_exp parameter.
  3. Generate a feature-cell matrix and output that matrix in MTX format that is compatible with Seurat and Scanpy.

Running the script with -h would display the help text that lists the required and optional parameters. This script should be run using the xenium conda environment. It's also important to specify the correct -pix_size argument depending on the level that was extracted from the image pyramid in the nucleus segmentation section.

#!/usr/bin/env python3 import argparse import csv import os import sys import numpy as np import pandas as pd import re import scipy.sparse as sparse import scipy.io as sio import subprocess # Define constant. # z-slices are 3 microns apart in morphology.ome.tif Z_SLICE_MICRON = 3 def main(): # Parse input arguments. args = parse_args() # Check for existence of input file. if (not os.path.exists(args.cellpose)): print("The specified Cellpose output (%s) does not exist!" % args.cellpose) sys.exit(0) if (not os.path.exists(args.transcript)): print("The specified transcripts.parquet file (%s) does not exist!" % args.transcript) sys.exit(0) # Check if output folder already exist. if (os.path.exists(args.out)): print("The specified output folder (%s) already exists!" % args.out) sys.exit(0) # Define additional constants NUC_EXP_PIXEL = args.nuc_exp / args.pix_size NUC_EXP_SLICE = args.nuc_exp / Z_SLICE_MICRON # Read Cellpose segmentation mask seg_data = np.load(args.cellpose, allow_pickle=True).item() mask_array = seg_data['masks'] # Use regular expression to extract dimensions from mask_array.shape m = re.match("\((?P<z_size>\d+), (?P<y_size>\d+), (?P<x_size>\d+)", str(mask_array.shape)) mask_dims = { key:int(m.groupdict()[key]) for key in m.groupdict() } # Read 5 columns from transcripts Parquet file transcripts_df = pd.read_parquet(args.transcript, columns=["feature_name", "x_location", "y_location", "z_location", "qv"]) # Find distinct set of features. features = np.unique(transcripts_df["feature_name"]) # Create lookup dictionary feature_to_index = dict() for index, val in enumerate(features): feature_to_index[str(val, 'utf-8')] = index # Find distinct set of cells. Discard the first entry which is 0 (non-cell) cells = np.unique(mask_array)[1:] # Create a cells x features data frame, initialized with 0 matrix = pd.DataFrame(0, index=range(len(features)), columns=cells, dtype=np.int32) # Iterate through all transcripts for index, row in transcripts_df.iterrows(): if index % args.rep_int == 0: print(index, "transcripts processed.") feature = str(row['feature_name'], 'utf-8') x = row['x_location'] y = row['y_location'] z = row['z_location'] qv = row['qv'] # Ignore transcript below user-specified cutoff if qv < args.qv_cutoff: continue # Convert transcript locations from physical space to image space x_pixel = x / args.pix_size y_pixel = y / args.pix_size z_slice = z / Z_SLICE_MICRON # Add guard rails to make sure lookup falls within image boundaries. x_pixel = min(max(0, x_pixel), mask_dims["x_size"] - 1) y_pixel = min(max(0, y_pixel), mask_dims["y_size"] - 1) z_slice = min(max(0, z_slice), mask_dims["z_size"] - 1) # Look up cell_id assigned by Cellpose. Array is in ZYX order. cell_id = mask_array[round(z_slice)] [round(y_pixel)] [round(x_pixel)] # If cell_id is 0, Cellpose did not assign the pixel to a cell. Need to perform # neighborhood search. See if nearest nucleus is within user-specified distance. if cell_id == 0: # Define neighborhood boundary for 3D ndarray slicing. Take image boundary into # consideration to avoid negative index. z_neighborhood_min_slice = max(0, round(z_slice-NUC_EXP_SLICE)) z_neighborhood_max_slice = min(mask_dims["z_size"], round(z_slice+NUC_EXP_SLICE+1)) y_neighborhood_min_pixel = max(0, round(y_pixel-NUC_EXP_PIXEL)) y_neighborhood_max_pixel = min(mask_dims["y_size"], round(y_pixel+NUC_EXP_PIXEL+1)) x_neighborhood_min_pixel = max(0, round(x_pixel-NUC_EXP_PIXEL)) x_neighborhood_max_pixel = min(mask_dims["x_size"], round(x_pixel+NUC_EXP_PIXEL+1)) # Call helper function to see if nearest nucleus is within user-specified distance. cell_id = nearest_cell(args, x_pixel, y_pixel, z_slice, x_neighborhood_min_pixel, y_neighborhood_min_pixel, z_neighborhood_min_slice, mask_array[z_neighborhood_min_slice : z_neighborhood_max_slice, y_neighborhood_min_pixel : y_neighborhood_max_pixel, x_neighborhood_min_pixel : x_neighborhood_max_pixel]) # If cell_id is not 0 at this point, it means the transcript is associated with a cell if cell_id != 0: # Increment count in feature-cell matrix matrix.at[feature_to_index[feature], cell_id] += 1 # Call a helper function to create Seurat and Scanpy compatible MTX output write_sparse_mtx(args, matrix, cells, features) #-------------------------- # Helper functions def parse_args(): """Parses command-line options for main().""" summary = 'Map Xenium transcripts to Cellpose segmentation result. \ Generate Seurat/Scanpy-compatible feature-cell matrix.' parser = argparse.ArgumentParser(description=summary) requiredNamed = parser.add_argument_group('required named arguments') requiredNamed.add_argument('-cellpose', required = True, help="The path to the *.ome_seg.npy file produced " + "by Cellpose.") requiredNamed.add_argument('-transcript', required = True, help="The path to the transcripts.parquet file produced " + "by Xenium.") requiredNamed.add_argument('-out', required = True, help="The name of output folder in which feature-cell " + "matrix is written.") requiredNamed.add_argument('-pix_size', required = True, type=float, help="The size of each pixel, in microns, for the " + "image that is passed to Cellpose for nucleus " + "segmentation.") parser.add_argument('-nuc_exp', default='10.0', type=float, help="The expansion distance from the nuclear boundary, " + "in microns, for cell boundary. (default: 10.0)") parser.add_argument('-qv_cutoff', default='20.0', type=float, help="Ignore transcripts with QV score below this " + "threshold. (default: 20.0)") parser.add_argument('-rep_int', default='10000', type=int, help="Reporting interval. Will print message to stdout " + "whenever the specified # of transcripts is processed. " + "(default: 10000)") try: opts = parser.parse_args() except: sys.exit(0) return opts def nearest_cell(args, x_pixel, y_pixel, z_slice, x_neighborhood_min_pixel, y_neighborhood_min_pixel, z_neighborhood_min_slice, mask_array): """Check if nearest nucleus is within user-specified distance. If function returns 0, it means no suitable nucleus was found.""" # Initialize constants NUC_EXP_PIXEL = args.nuc_exp / args.pix_size # For Euclidean distance, we need to convert z-slice to z-micron SLICE_TO_PIXEL = Z_SLICE_MICRON / args.pix_size # When we take a neighborhood slice of mask_array, all indices start at (0,0,0). # This INDEX_SHIFT is necessary to reconstruct coordinates from original mask_array. INDEX_SHIFT = np.array([z_neighborhood_min_slice, y_neighborhood_min_pixel, x_neighborhood_min_pixel]) min_dist = NUC_EXP_PIXEL cell_id = 0 # Enumerate through all points in the neighborhood for index, cell in np.ndenumerate(mask_array): # Current point is not assigned to a nucleus. if cell == 0: continue # Current point IS assigned to a nucleus. But is it within NUC_EXP_PIXEL? else: img_loc = np.asarray(index, dtype=float) + INDEX_SHIFT # Convert from z-slice to "z-pixel" img_loc[0] *= SLICE_TO_PIXEL transcript_loc = np.array([z_slice * SLICE_TO_PIXEL, y_pixel, x_pixel]) # Calculate Euclidean distance between 2 points dist = np.linalg.norm(transcript_loc - img_loc) if dist < min_dist: min_dist = dist cell_id = cell return cell_id def write_sparse_mtx(args, matrix, cells, features): """Write feature-cell matrix in Seurat/Scanpy-compatible MTX format""" # Create the matrix folder. os.mkdir(args.out) # Convert matrix to scipy's COO sparse matrix. sparse_mat = sparse.coo_matrix(matrix.values) # Write matrix in MTX format. sio.mmwrite(args.out + "/matrix.mtx", sparse_mat) # Write cells as barcodes.tsv. File name is chosen to ensure # compatibility with Seurat/Scanpy. with open(args.out + "/barcodes.tsv", 'w', newline='') as tsvfile: writer = csv.writer(tsvfile, delimiter='\t', lineterminator='\n') for cell in cells: writer.writerow(["cell_" + str(cell)]) # Write features as features.tsv. Write 3 columns to ensure # compatibility with Seurat/Scanpy. with open(args.out + "/features.tsv", 'w', newline='') as tsvfile: writer = csv.writer(tsvfile, delimiter='\t', lineterminator='\n') for f in features: feature = str(f, 'utf-8') if feature.startswith("NegControlProbe_") or feature.startswith("antisense_"): writer.writerow([feature, feature, "Negative Control Probe"]) elif feature.startswith("NegControlCodeword_"): writer.writerow([feature, feature, "Negative Control Codeword"]) elif feature.startswith("BLANK_"): writer.writerow([feature, feature, "Blank Codeword"]) else: writer.writerow([feature, feature, "Gene Expression"]) # Seurat expects all 3 files to be gzipped subprocess.run("gzip -f " + args.out + "/*", shell=True) if __name__ == "__main__": main()

If the script above is saved to a file called map_transcripts.py, then a sample invocation might look like the following:

# Activate the xenium conda environment conda activate xenium # Map transcripts to cells ./map_transcripts.py -cellpose level_3_morphology.ome_seg.npy -transcript transcripts.parquet -out feature_cell_matrix -pix_size 1.7

The script spends the majority of time performing neighborhood searches, so the choice of -nuc_exp will greatly impact overall runtime. Larger nuclear expansion distances will allow more transcripts to be associated with a cell, but we expect a roughly cubic growth in runtime since neighborhoods expand in 3 dimensions.

The table below shows the relationship between nuclear expansion distance, cell-associated Q20 transcripts, and script runtime. These results are based on processing the first million transcripts, 806,508 of which have Q-Score >= 20, from a mouse brain dataset. Depending on the nuclear expansion distance, different numbers of transcripts can be associated with a cell (cell-associated Q20 transcript) as shown in the table below. The whole dataset has almost 20 million decoded transcripts, so we can expect a twenty-fold increase in runtime to map all transcripts in this mouse brain sample.

Nuclear Expansion DistanceCell-Associated Q20 TranscriptsScript Runtime
0 micron49,8462 min. 23 sec.
5 microns98,03112 min. 24 sec.
10 microns125,87460 min. 14 sec.
15 microns139,917165 min. 7 sec.

After running the Python transcript mapping script, the resulting feature-cell matrix can be read by popular third-party tools such as Seurat and Scanpy.

To demonstrate compatibility, we were able to follow this Seurat vignette and use the Read10X() function to load the matrix. We produced the following plots, and it should be straightforward to cluster the cells, generate a UMAP projection, and find differentially expressed features.

Likewise, we were able to follow this Scanpy tutorial and use the read_10x_mtx() function to load the matrix. The following plot shows the 20 genes that have the highest fraction of counts in each cell, across all cells, and it should be straightforward to proceed further with the tutorial and run the standard suite of single cell data analysis with this feature-cell matrix.

If you use Cellpose in your publication, please cite Cellpose based on citation guidelines here.

Stay connected with latest technical workflow and software updatesSubscribe to newsletter