Choose a product below to filter the page content to your needs:
Cell Ranger provides pre-built human, mouse, and barnyard (human & mouse) reference packages for read alignment and gene expression quantification in cellranger count
. Build notes are available here.
Cell Ranger allows users to create a custom reference package using cellranger mkref
. To make a custom reference, you will need a reference genome sequence (FASTA file) and gene annotations (GTF file). A tutorial, Build a Custom Reference (cellranger mkref), is available to walk you through the steps.
Custom references built with previous versions of cellranger mkref
can be used with the latest versions of cellranger count
or cellranger multi
. However, references built with the latest cellranger mkref
may not be compatible with all older versions of the pipelines.
Cell Ranger supports the use of customer-generated references under the following conditions:
- Your reference should have only a small number of overlapping gene annotations. Reads aligning non-uniquely to multiple genes cause the pipeline to detect fewer molecules.
- Your FASTA and GTF files must be compatible with the open source splicing-aware RNA seq aligner, STAR. To be considered for transcriptome alignment, genes must have annotations with feature type 'exon' (column 3) in the GTF file.
Example use cases:
To create a custom reference:
GTF files downloaded from sites like ENSEMBL and UCSC often contain transcripts and genes which need to be filtered from your final annotation. Cell Ranger provides mkgtf
, a simple utility to filter genes based on their key-value pairs in the GTF attribute column. The command syntax requires input and output GTF file names and --attribute
values specifying gene biotypes to filter from the GTF file:
cellranger mkgtf input.gtf output.gtf --attribute=key:allowable_value
In the command above, the allowable_value
can be any of the accepted biotypes listed below:
protein_coding
lncRNA
antisense
IG_C_gene
IG_D_gene
IG_J_gene
IG_LV_gene
IG_V_gene
IG_V_pseudogene
IG_J_pseudogene
IG_C_pseudogene
TR_C_gene
TR_D_gene
TR_J_gene
TR_V_gene
TR_V_pseudogene
TR_J_pseudogene
For example, the following filtering was applied to generate the GTF file for the GRCh38 Cell Ranger reference package:
cellranger mkgtf Homo_sapiens.GRCh38.ensembl.gtf Homo_sapiens.GRCh38.ensembl.filtered.gtf \
--attribute=gene_biotype:protein_coding \
--attribute=gene_biotype:lncRNA \
--attribute=gene_biotype:antisense \
--attribute=gene_biotype:IG_LV_gene \
--attribute=gene_biotype:IG_V_gene \
--attribute=gene_biotype:IG_V_pseudogene \
--attribute=gene_biotype:IG_D_gene \
--attribute=gene_biotype:IG_J_gene \
--attribute=gene_biotype:IG_J_pseudogene \
--attribute=gene_biotype:IG_C_gene \
--attribute=gene_biotype:IG_C_pseudogene \
--attribute=gene_biotype:TR_V_gene \
--attribute=gene_biotype:TR_V_pseudogene \
--attribute=gene_biotype:TR_D_gene \
--attribute=gene_biotype:TR_J_gene \
--attribute=gene_biotype:TR_J_pseudogene \
--attribute=gene_biotype:TR_C_gene
This generated a filtered GTF file Homo_sapiens.GRCh38.ensembl.filtered.gtf
from the original unfiltered GTF file Homo_sapiens.GRCh38.ensembl.gtf
. In the output file, other biotypes such as gene_biotype:pseudogene
are excluded from the GTF annotation.
To create custom references, use the cellranger mkref
command, passing it one or more matching sets of FASTA and GTF files. This utility copies your FASTA and GTF, indexes these in several formats, and outputs a folder with the name you pass to --genome
. Input GTF files are typically filtered with mkgtf
prior to mkref
.
Argument | Description |
---|---|
--genome | Required. Unique genome name(s), used to name output folder. Should contain only alphanumeric characters and optionally period, hyphen, and underscore characters [a-zA-Z0-9_-]+. Specify multiple genomes by specifying the --genome argument multiple times. |
--fasta | Required. Path(s) to FASTA file containing your genome reference. Specify multiple genomes by specifying the --fasta argument multiple times. |
--genes | Required. Path(s) to GTF file(s) containing annotated genes for your genome reference. Specify multiple genomes with the --genes argument for each genome. |
--memgb | Optional. Maximum memory (GB) used during STAR genome index generation. Defaults to 16. Please note, the amount of memory specified must be greater than the number of gigabases in the input reference FASTA file. |
--ref-version | Optional. Reference version string to include with reference. |
--nthreads | Optional. Number of threads used during STAR genome index generation. Defaults to 1. |
--help or -h | Optional. Show list of all arguments and options. |
--version | Optional. Show version. |
Basic usage
cellranger mkref --genome=output_genome --fasta=input.fa --genes=input.gtf
System requirements
Indexing a typical human 3Gb FASTA file often takes up to 8 core hours and requires 32 GB of memory.
Outputs
A successful mkref
run should conclude with a message similar to this:
Creating new reference folder at output_genome
...done
Writing genome FASTA file into reference folder...
...done
Computing hash of genome FASTA file...
...done
Writing genes GTF file into reference folder...
WARNING: The following transcripts appear on multiple chromosomes in the GTF:
This can indicate a problem with the reference or annotations. Only the first chromosome will be counted.
...done
Computing hash of genes GTF file..
...done
Writing genes index file into reference folder (may take over 10 minutes for a 3Gb genome)..
...done
Writing genome metadata JSON file into reference folder...
...done
Generating STAR genome index (may take over 8 core hours for a 3Gb genome)...
...done.
>>> Reference successfully created! <<<
Output listing:
── output_genome
├── fasta
│ ├── genome.fa
│ └── genome.fa.fai
├── genes
│ └── genes.gtf.gz
├── reference.json
└── star
├── chrLength.txt
├── chrNameLength.txt
├── chrName.txt
├── chrStart.txt
├── exonGeTrInfo.tab
├── exonInfo.tab
├── geneInfo.tab
├── Genome
├── genomeParameters.txt
├── SA
├── SAindex
├── sjdbInfo.txt
├── sjdbList.fromGTF.out.tab
├── sjdbList.out.tab
└── transcriptInfo.tab
The most common use case is to create a reference for only one species. In this case, there is one set of matched FASTA and GTF files typically obtained from Ensembl, NCBI, or UCSC.
cellranger mkref --genome=GRCh38 --fasta=GRCh38.fa --genes=GRCh38-filtered-ensembl.gtf
When possible, please obtain genome sequence (FASTA) and gene annotations (GTF) from the same source: Use Ensembl FASTA files with Ensembl GTF files. Chromosome or sequence names in the FASTA file must match the chromosome or sequence names in the GTF file.
As noted in the STAR manual, the most comprehensive genome sequence and annotations are recommended:
-
For the genome sequence, include all major chromosomes, unplaced and unlocalized scaffolds, but do not include patches and alternative haplotypes.
- In Ensembl, the recommended genome file to download is annotated as "primary assembly."
- In NCBI, it is "no alternative - analysis set."
-
For the GTF file, genes must be annotated with feature type "exon" (column 3).
- Prior to
mkref
, GTF annotation files from Ensembl and NCBI are typically filtered withmkgtf
to include only a subset of the annotated gene biotypes.
- Prior to
To create a reference for multiple species, run the mkref
command with multiple FASTA and GTF files. This is similar to the single species case above, but note that the order of the arguments matters. The arguments are grouped by the order they appear; for instance, the first --genome
option listed corresponds to the first --fasta
and --genes
options listed. Please use or create this type of reference when analyzing barnyard validation experiments for estimating multiplet rates.
cellranger mkref --genome=GRCh38 --fasta=GRCh38.fa --genes=GRCh38-filtered-ensembl.gtf \
--genome=mm10 --fasta=mm10.fa --genes=mm10-filtered-ensembl.gtf
Provided that you follow the format described above, it is fairly simple to add custom gene definitions to an existing reference. First, add the additional FASTA sequence records to the fasta/genome.fa
file. Next, update the GTF file, genes/genes.gtf
, with the gene annotation record(s). An example is described in the cellranger mkref
tutorial for adding a marker gene to the FASTA and GTF files.
When setting up your experiment, consider incorporating the UTR sequence, especially the 3' UTR, into the marker gene. The 10x Genomics Gene Expression assays target transcripts through their poly-A tails, and the 3' Gene Expression assays focus on the 3' ends of transcripts to create sequencing library inserts. Consequently, reads are expected to align towards the 3' end of a transcript, extending into the UTRs. If the UTR sequence is not unique (i.e., recapitulates a sequence in another transcript), it is crucial to include this UTR in your analysis to discount all reads aligning to it and the other locus. Failing to do so can lead to artificially inflated counts at these other loci, while underrepresenting counts for the intended marker gene. For more details, refer to the 10x Genomics Knowledge Base article.
The GTF file format is essentially a list of records, one per line, each comprising nine tab-delimited non-empty fields.
Column | Name | Description |
---|---|---|
1 | Chromosome | Must refer to a chromosome/contig in the genome fasta. |
2 | Source | Unused. |
3 | Feature | cellranger count only uses rows where this line is exon . |
4 | Start | Start position on the reference (1-based inclusive). |
5 | End | End position on the reference (1-based inclusive). |
6 | Score | Unused. |
7 | Strand | Strandedness of this feature on the reference: + or - . |
8 | Frame | Unused. |
9 | Attributes | A semicolon-delimited list of key-value pairs of the form key "value" . The attribute keys transcript_id and gene_id are required; gene_name is optional and may be non-unique, but if present will be preferentially displayed in reports. |
After adding the necessary records to your FASTA file and the additional lines to your GTF file, run cellranger mkref
.
The single-nuclei RNA-seq assay captures unspliced pre-mRNA as well as mature mRNA. However, after alignment, cellranger count
only counts reads aligned to exons. Since the pre-mRNA will generate intronic reads, it may be useful to count these reads as well. Previously, it was recommended to create a custom “pre-mRNA” reference package, listing each gene transcript locus as an exon, in order to count intronic reads. In Cell Ranger v5.0, there is a new include-introns
option for counting intronic reads that should be used instead, and the usage of pre-mRNA references is deprecated.
A read may align to multiple transcripts and genes, but Cell Ranger only considers a read confidently mapped to the transcriptome if it is mapped to a single gene (after converting the xf
tag value to binary, 1-bit means the read is confidently mapped to the transcriptome).
To assess whether reads mapped to multiple genes, examine the GX
or GN
tags in the output BAM file, which are generated by Cell Ranger after alignment with STAR. Uniquely mapped reads will have one gene ID for GX
and one gene name for GN
, while multi-mapped reads will list multiple gene IDs and names.
If you encounter a crash while running cellranger mkref
, upload the tarball (with the extension .mri.tgz
) in your output directory. Customize the code with your email:
cellranger upload [email protected] genome_id.mri.tgz
Where genome_id
is what you input into the --genome
option of mkfref
. This tarball contains numerous diagnostic logs that 10x Genomics support can use for debugging. You will receive an automated email from 10x Genomics. If not, email [email protected]. For the fastest service, respond with the following:
- The exact
cellranger
command you used - The sample sheet that you used
- The
RunInfo.xml
andrunParameters.xml
files from your BCL directory - The kind of libraries you are demultiplexing (including chemistry)