Space Ranger provides pre-built human (GRCh38) and mouse (mm10) reference packages for read alignment and gene expression quantification in
To create and use a custom reference package, Space Ranger requires a reference genome sequence (FASTA file) and gene annotations (GTF file).
Space Ranger supports the use of customer-generated references under the following conditions:
Your reference should have only a small number of overlapping gene annotations.
- Reads aligning non-uniquely to multiple genes cause the pipeline to detect fewer molecules.
Your FASTA and GTF files must be compatible with the open source splicing-aware RNA-seq aligner, STAR.
- To be considered for transcriptome alignment, genes must have annotations with feature type 'exon' (column 3) in the GTF file.
Space Ranger does not support the use of customer-generated references in combination with Visium for FFPE probe sets.
To create a custom reference:
- Filter GTF file with
mkgtfto contain only genes of interest.
- Index the FASTA and GTF files with
Example use cases:
GTF files downloaded from sites like ENSEMBL and UCSC often contain transcripts and genes which need to be filtered from your final annotation. Space Ranger provides
spaceranger mkgtf, a simple utility to filter genes based on their key-value pairs in the GTF attribute column. The command syntax requires input and output GTF file names and
--attribute values specifying gene biotypes to filter from the GTF file (replace values in bold):
In the command above, the
allowable_value can be any of the accepted biotypes listed below:
For example, the following filtering was applied to generate the GTF file for the GRCh38 Space Ranger reference package:
$ spaceranger mkgtf Homo_sapiens.GRCh38.ensembl.gtf Homo_sapiens.GRCh38.ensembl.filtered.gtf \ --attribute=gene_biotype:protein_coding \ --attribute=gene_biotype:lincRNA \ --attribute=gene_biotype:antisense \ --attribute=gene_biotype:IG_LV_gene \ --attribute=gene_biotype:IG_V_gene \ --attribute=gene_biotype:IG_V_pseudogene \ --attribute=gene_biotype:IG_D_gene \ --attribute=gene_biotype:IG_J_gene \ --attribute=gene_biotype:IG_J_pseudogene \ --attribute=gene_biotype:IG_C_gene \ --attribute=gene_biotype:IG_C_pseudogene \ --attribute=gene_biotype:TR_V_gene \ --attribute=gene_biotype:TR_V_pseudogene \ --attribute=gene_biotype:TR_D_gene \ --attribute=gene_biotype:TR_J_gene \ --attribute=gene_biotype:TR_J_pseudogene \ --attribute=gene_biotype:TR_C_gene
This generated a filtered GTF file
Homo_sapiens.GRCh38.ensembl.filtered.gtf from the original unfiltered GTF file
Homo_sapiens.GRCh38.ensembl.gtf. In the output file, other biotypes such as
gene_biotype:pseudogene are excluded from the GTF annotation
To create custom references, use the
spaceranger mkref command, passing it one or more matching sets of FASTA and GTF files. This utility copies your FASTA and GTF, indexes these in several formats, and outputs a folder with the name you pass to
--genome. Input GTF files are typically filtered with
spaceranger mkgtf prior to
| ||Required. Unique genome name(s), used to name output folder. Should contain only alphanumeric characters and optionally period, hyphen, and underscore characters [a-zA-Z0-9_-]+. Specify multiple genomes by specifying the --genome argument multiple times.|
| ||Required. Path(s) to FASTA file containing your genome reference. Specify multiple genomes by specifying the --fasta argument multiple times.|
| ||Required. Path(s) to genes GTF file(s) containing annotated genes for your genome reference. Specify multiple genomes by specifying the --genes argument multiple times.|
| ||Optional. Number of threads used during STAR genome index generation. Defaults to 1.|
| ||Optional. Maximum memory (GB) used during STAR genome index generation. Defaults to 16. Please note, the amount of memory specified must be greater than the number of gigabases in the input reference FASTA file.|
| ||Optional. Reference version string to include with reference.|
The command syntax requires defining input FASTA file and the GTF file along with a unique name for the genome output folder (replace values in bold):
Indexing a typical human 3Gb FASTA file often takes up to 8 core hours and requires 32 GB of memory. 10x Genomics recommends running the
spaceranger mkref command with
--nthreads equal to the number of cores available on your system.
spaceranger mkref run should conclude with a message similar to this:
Creating new reference folder at output_genome ...done Writing genome FASTA file into reference folder... ...done Computing hash of genome FASTA file... ...done Writing genes GTF file into reference folder... WARNING: The following transcripts appear on multiple chromosomes in the GTF: This can indicate a problem with the reference or annotations. Only the first chromosome will be counted. ...done Computing hash of genes GTF file... ...done Writing genes index file into reference folder (may take over 10 minutes for a 3Gb genome)... ...done Writing genome metadata JSON file into reference folder... ...done Generating STAR genome index (may take over 8 core hours for a 3Gb genome)... ...done. \>\>\> Reference successfully created! \<\<\<
── output_genome ├── fasta │ ├── genome.fa │ └── genome.fa.fai ├── genes │ └── genes.gtf.gz ├── reference.json └── star ├── chrLength.txt ├── chrNameLength.txt ├── chrName.txt ├── chrStart.txt ├── exonGeTrInfo.tab ├── exonInfo.tab ├── geneInfo.tab ├── Genome ├── genomeParameters.txt ├── SA ├── SAindex ├── sjdbInfo.txt ├── sjdbList.fromGTF.out.tab ├── sjdbList.out.tab └── transcriptInfo.tab
The most common use case is to create a reference for only one species. In this case, there is one set of matched FASTA and GTF files typically obtained from Ensembl, NCBI, or UCSC.
$ spaceranger mkref --genome=hg19 --fasta=hg19.fa --genes=hg19-filtered-ensembl.gtf
When possible, obtain genome sequence (FASTA) and gene annotations (GTF) from the same source: Use Ensembl FASTA files with Ensembl GTF files. Chromosome or sequence names in the FASTA file must match the chromosome or sequence names in the GTF file.
As noted in the STAR manual, the most comprehensive genome sequence and annotations are recommended:
For the genome sequence, include all major chromosomes, unplaced and unlocalized scaffolds, but do not include patches and alternative haplotypes.
- In Ensembl, the recommended genome file to download is annotated as "primary assembly."
- In NCBI, it is "no alternative - analysis set."
For the GTF file, genes must be annotated with feature type 'exon' (column 3).
- Prior to
mkref, GTF annotation files from Ensembl and NCBI are typically filtered with
mkgtfto include only a subset of the annotated gene biotypes.
- Prior to
To create a reference for multiple species, run the
spaceranger mkref command with multiple FASTA and GTF files. This is similar to the single species case above, but note that the order of the arguments matters. The arguments are grouped by the order they appear. For instance, the first
--genome option listed corresponds to the first
--genes options listed. Multiple species references could be useful in the Visium Spatial Gene Expression Solution when the tissue slide contains cells from multiple organisms. For instance, this can happen in a mouse that has been engrafted with human cells.
$ spaceranger mkref --genome=hg19 --fasta=hg19.fa --genes=hg19-filtered-ensembl.gtf \ --genome=mm10 --fasta=mm10.fa --genes=mm10-filtered-ensembl.gtf
Provided that you follow the format described above, it is fairly simple to add custom gene definitions to an existing reference. First, add the additional FASTA sequence records to the
fasta/genome.fa file. Next, update the GTF file,
genes/genes.gtf, with the gene annotation record(s).
The GTF file format is essentially a list of records, one per line, each comprising nine tab-delimited non-empty fields.
|1||Chromosome||Must refer to a chromosome/contig in the genome fasta.|
|4||Start||Start position on the reference (1-based inclusive).|
|5||End||End position on the reference (1-based inclusive).|
|7||Strand||Strandedness of this feature on the reference: |
|9||Attributes||A semicolon-delimited list of key-value pairs of the form |
After adding the necessary records to your FASTA file and the additional lines to your GTF file, run spaceranger mkref as normal.
A read may align to multiple transcripts and genes, but Space Ranger only considers a read confidently mapped to the transcriptome if it is mapped to a single gene (after converting the
xf tag value to binary, 1-bit means the read is confidently mapped to the transcriptome).
To assess whether reads mapped to multiple genes, examine the
GN tags in the output BAM file, which are generated by Space Ranger after alignment with STAR. Uniquely mapped reads will have one gene ID for
GX and one gene name for
GN , while multi-mapped reads will list multiple gene IDs and names.