The cellranger-arc pipeline outputs an HDF5 file containing per-molecule information for all molecules that contain a valid barcode and valid UMI and were assigned with high confidence to a gene. This HDF5 file contains data corresponding to the observed molecules, as well as data about the libraries, features, and barcode lists used for the analysis.
(root)
├─ barcode_idx
├─ barcode_info	[HDF5 group]
│   ├─ genomes
│   └─ pass_filter
├─ barcodes
├─ count
├─ feature_idx
├─ features	[HDF5 group]
│   ├─ _all_tag_keys
│   ├─ feature_type
│   ├─ genome
│   ├─ id
│   └─ name
├─ gem_group
├─ library_idx
├─ library_info
├─ metrics_json
├─ umi
└─ umi_type
The following HDF5 datasets in the molecule information file correspond to columns of a table. Each row of that table corresponds to a unique molecule specified by (UMI, cell-barcode, feature) tuple. This tuple indicates the feature best supported by the reads (including PCR duplicates) assigned to that unique pairing of UMI and 10x Barcode.
| Column | Type | Description | 
|---|---|---|
| barcode_idx | uint64 | A zero-based index into the barcodesdataset (see next section), indicating the 10x Barcode sequence assigned to this putative molecule. | 
| count | uint32 | Number of reads associated with this putative molecule that were confidently mapped to the assigned feature. | 
| feature_idx | uint32 | A zero-based index into the featuresHDF5 group (see next section), indicating the feature to which this putative molecule was assigned. | 
| gem_group | uint16 | Integer label that distinguishes data derived from distinct 10x Genomics GEM reactions (such as different chip or chip channels). | 
| library_idx | uint16 | A zero-based index into the library_infoarray (see next section) that distinguishes data coming from distinct 10x Genomics libraries. For the Chromium Single Cell Multiome ATAC + Gene Expression assay only one library can be associated with a single GEM well. | 
| umi | uint32 | 2-bit encoded (see note below) processed (i.e. corrected) UMI sequence. | 
| umi_type | uint32 | A boolean array specifying whether the molecule aligned to an exonic (1) or intronic (0) region of the associated feature. | 
The barcodes and library_info datasets provide information about the experiments contained in this analysis.
| Dataset | Type | Description | 
|---|---|---|
| barcodes | string | A list of all 10x Barcodes associated with this experiment (including those that were not observed). The barcode_idxcolumn described in the previous section contains indices into this list of barcodes. | 
| library_info | string | A JSON-formatted array of objects, where each object contains metadata for a single library. Each library will at a minimum contain the metadata library_id,library_type, andgem_group | 
The HDF5 group barcode_info provides information regarding the barcodes that were called as cells during the analysis. This HDF5 group contains two columns.
| Dataset | Type | Description | 
|---|---|---|
| genomes | string | A list of all genome references used in this analysis. In most cases, this will be a single genome. | 
| pass_filter | uint64 | A matrix with three columns that contains one row per cell-barcode. Each row is a tuple (barcode_idx, library_idx, genome_idx), wheregenome_idxis an index into thegenomesdataset. | 
The HDF5 group features contains information regarding the feature reference used for the analysis. The datasets within the features group represent columns of a table containing one row per feature. Values in the feature_idx column described in the previous section provide indices into the rows of this table.
In addition to the columns described below, _all_tag_keys contains a list of built-in tags (genome).
| Column | Type | Description | 
|---|---|---|
| feature_type | string | The type of feature reference to which this feature belongs (Gene Expression). | 
| genome | string | The genome reference for a given feature (e.g., "GRCh38" or "mm10"). | 
| id | string | The The Ensembl gene ID corresponding to this feature. | 
| name | string | The common gene symbol associated with each of the above ids. | 
The UMI sequences are 2-bit encoded as follows:
- Each pair of bits encodes a nucleotide (0="A", 1="C", 2="G", 3="T").
- The least significant byte (LSB) contains the 3'-most nucleotides.
Note that the cell-barcode sequences do not have this encoding. Instead, they are stored as plain strings in the library_info/barcodes HDF5 dataset.
The metrics_json dataset contains pipeline metrics in JSON format that are used internally by Cell Ranger. Users should view metrics using the Cell Ranger ARC metrics outputs.