10x Genomics Support/Cell Ranger/Tutorials/

Build a Custom Reference for Cell Ranger (mkref)

10x Genomics provides pre-built references for human and mouse genomes to use with Cell Ranger. Researchers can make custom reference genomes for additional species or add custom marker genes of interest to the reference, e.g. GFP. The following tutorial outlines the steps to build a custom reference using the cellranger mkref pipeline.

This tutorial follows the same steps used to create the 10x Genomics pre-built references for human and mouse. These steps can be found on this page: Build Notes for Reference Packages.

First, locate the reference genome FASTA and GTF files for your species. If the species is available from the Ensembl database, we recommend using the files from there. The GTF files from Ensembl contain optional tags that make filtering easy. If your species of interest is not available from Ensembl, GTF and FASTA files from other sources can also work. Note that a GTF file is required, while a GFF file is not supported. (See GFF/GTF File Format - Definition and supported options)

This tutorial generates a custom reference for the zebrafish, Danio rerio.

The files needed are located on Ensembl (check this page for any reference updates).

Navigate to the Gene annotation section of the Ensembl website and click on the Download GTF link. This takes you to an FTP site with a list of GTF files available. Select the file called Danio_rerio.GRCz11.105.gtf.gz. This is the GTF annotation file for this species. All species in Ensembl have similar files available to download. For more information on the GTF files in Ensembl, read the README file at the FTP site.

Right-click the link to copy the address, paste the URL into the command line, and download using the wget command:

The file is approximately 20 MB and takes less than a minute to download depending on your system.

wget http://ftp.ensembl.org/pub/release-105/gtf/danio_rerio/Danio_rerio.GRCz11.105.gtf.gz

Decompress the file with the gunzip command:

gunzip Danio_rerio.GRCz11.105.gtf.gz

Next, navigate back to the Ensembl page for Danio rerio and click on 'Download FASTA' to access the FTP site containing several types of FASTA files. Select the dna/ directory to access the directory with genome files. Download the FASTA file containing all the chromosomes together in the genome, which has primary assembly in the filename. Right-click on the link to copy the address. Paste the URL into the command line and download it with the wget command:

The file is approximately 400 MB and takes several minutes to download, depending on your system.

wget http://ftp.ensembl.org/pub/release-105/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz

Decompress the file with the gunzip command:

gunzip Danio_rerio.GRCz11.dna.primary_assembly.fa.gz

GTF files can contain entries for non-polyA transcripts that overlap with protein-coding gene models. These entries can cause reads to be flagged as mapped to multiple genes (multi-mapped) because of the overlapping annotations. In the case where reads are flagged as multi-mapped, they are not counted. See these resources for further information:

To remove these entries from the GTF, add this filter argument to the mkgtf command: --attribute=gene_biotype:protein_coding (see list of accepted biotypes). All of the filters used to build references are listed on the support site. If you are using a GTF file that does not contain gene_biotype attributes or is missing other entries, don't worry too much; there may still be enough information to build a reference. A minimal GTF file only needs to contain exon features for protein coding genes.

Set up the command:

cellranger mkgtf \ Danio_rerio.GRCz11.105.gtf \ Danio_rerio.GRCz11.105.filtered.gtf \ --attribute=gene_biotype:protein_coding

This will output the file Danio_rerio.GRCz11.105.filtered.gtf, which will be used in the next step.

Reference transcriptiomes available for many non-traditional model organisms may not have the same level of annotation seen for tranditional models (e.g. human and mouse). Therefore, the filtering options used to build the 10x human and mouse (pre-built) references may differ from those shown here. Visit the reference build notes page to see what additional filters were used.

Now that you have the genome FASTA and filtered GTF files needed, set up the command to run the cellranger mkref pipeline.

The following is the command:

cellranger mkref \ --genome=Drerio_genome \ --fasta=Danio_rerio.GRCz11.dna.primary_assembly.fa \ --genes=Danio_rerio.GRCz11.105.filtered.gtf

Run the command. This can take several hours, depending on your system. If you are working on a shared computing environment such as an HPC cluster, submit this as a job to prevent competing with other users for resources.

A successful mkref run should conclude with this message:

>>> Reference successfully created! <<< You can now specify this reference on the command line: cellranger --transcriptome=/Danio.rerio_genome ...

The reference was successfully created, as noted in the output message above, in the directory specified by the --genome flag (Danio.rerio_genome in this case). If you do not see this message, an error likely occurred. Please copy the error message and send an email to [email protected]. The outputs are organized like this:

├── fasta │ ├── genome.fa │ └── genome.fa.fai ├── genes │ └── genes.gtf.gz ├── reference.json └── star ├── chrLength.txt ├── chrNameLength.txt ├── chrName.txt ├── chrStart.txt ├── exonGeTrInfo.tab ├── exonInfo.tab ├── geneInfo.tab ├── Genome ├── genomeParameters.txt ├── SA ├── SAindex ├── sjdbInfo.txt ├── sjdbList.fromGTF.out.tab ├── sjdbList.out.tab └── transcriptInfo.tab

There are cases where the publicly available GTF and FASTA files will not contain information for some of the genes expressed in a given sample. A transgenic sample is a good example of when you would not expect a gene of interest to be in the reference. In this example, the common marker gene, Green Fluorescent Protein (GFP) (used as an in-vivo fluorescent reporter for gene expression) is added to the reference. This method of adding genes to a reference has been reported to work for detecting genes from viral infections provided the detected transcripts are poly-adenylated.

Note: This is only one of many mRNA sequences available encoding for GFP. Make sure to use the sequence-specific for your assay. Depending on your experimental set-up, you may want to include a 3' UTR sequence, see the advanced reference section for more information.

For this example, we use a full GFP sequence from GenBank. The sequence below runs 5' to 3' and the sequence highlighted in blue is the untranslated region (UTR):

>L29345.1 Aequorea victoria green-fluorescent protein (GFP) mRNA, complete cds TACACACGAATAAAAGATAACAAAGATGAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTT GTTGAATTAGATGGCGATGTTAATGGGCAAAAATTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACAT ACGGAAAACTTACCCTTAAATTTATTTGCACTACTGGGAAGCTACCTGTTCCATGGCCAACACTTGTCAC TACTTTCTCTTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAACAGCATGACTTTTTCAAG AGTGCCATGCCCGAAGGTTATGTACAGGAAAGAACTATATTTTACAAAGATGACGGGAACTACAAGACAC GTGCTGAAGTCAAGTTTGAAGGTGATACCCTTGTTAATAGAATCGAGTTAAAAGGTATTGATTTTAAAGA AGATGGAAACATTCTTGGACACAAAATGGAATACAACTATAACTCACATAATGTATACATCATGGCAGAC AAACCAAAGAATGGAATCAAAGTTAACTTCAAAATTAGACACAACATTAAAGATGGAAGCGTTCAATTAG CAGACCATTATCAACAAAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTC CACACAATCTGCCCTTTCCAAAGATCCCAACGAAAAGAGAGATCACATGATCCTTCTTGAGTTTGTAACA GCTGCTGGGATTACACATGGCATGGATGAACTATACAAATAAATGTCCAGACTTCCAATTGACACTAAAG TGTCCGAACAATTACTAAATTCTCAGGGTTCCTGGTTAAATTCAGGCTGAGACTTTATTTATATATTTAT AGATTCATTAAAATTTTATGAATAATTTATTGATGTTATTAATAGGGGCTATTTTCTTATTAAATAGGCT ACTGGAGTGTAT

Copy and paste this sequence and save as a text file called GFP_orig.fa. The header of this file looks like the following:

>L29345.1 Aequorea victoria green-fluorescent protein (GFP) mRNA, complete cds

There are special characters such as spaces in the header (all text after the >) of this FASTA sequence. These can be problematic for downstream applications. It can be helpful to change the header to be more informative and also to remove these characters. The following command opens the file and uses the stream editor (sed) function to search for a pattern (the original header), replace it with new text ("GFP"), then directs the output to a new output file, GFP.fa.

cat GFP_orig.fa | sed s/L29345\.\1\ Aequorea\ victoria\ green\-fluorescent\ protein\ \(GFP\)\ mRNA\,\ complete\ cds/GFP/ > GFP.fa

Note: Another option is to open the GFP_orig.fa file with a text editor, such as nano, then manually edit the header and save the file as GFP.fa. Choose whichever method of changing the header you feel most comfortable with.

Now the FASTA file GFP.fa looks like the following:

>GFP TACACACGAATAAAAGATAACAAAGATGAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTT GTTGAATTAGATGGCGATGTTAATGGGCAAAAATTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACAT ACGGAAAACTTACCCTTAAATTTATTTGCACTACTGGGAAGCTACCTGTTCCATGGCCAACACTTGTCAC TACTTTCTCTTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAACAGCATGACTTTTTCAAG AGTGCCATGCCCGAAGGTTATGTACAGGAAAGAACTATATTTTACAAAGATGACGGGAACTACAAGACAC GTGCTGAAGTCAAGTTTGAAGGTGATACCCTTGTTAATAGAATCGAGTTAAAAGGTATTGATTTTAAAGA AGATGGAAACATTCTTGGACACAAAATGGAATACAACTATAACTCACATAATGTATACATCATGGCAGAC AAACCAAAGAATGGAATCAAAGTTAACTTCAAAATTAGACACAACATTAAAGATGGAAGCGTTCAATTAG CAGACCATTATCAACAAAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTC CACACAATCTGCCCTTTCCAAAGATCCCAACGAAAAGAGAGATCACATGATCCTTCTTGAGTTTGTAACA GCTGCTGGGATTACACATGGCATGGATGAACTATACAAATAAATGTCCAGACTTCCAATTGACACTAAAG TGTCCGAACAATTACTAAATTCTCAGGGTTCCTGGTTAAATTCAGGCTGAGACTTTATTTATATATTTAT AGATTCATTAAAATTTTATGAATAATTTATTGATGTTATTAATAGGGGCTATTTTCTTATTAAATAGGCT ACTGGAGTGTAT

To find the number of bases in this sequence, we will use the grep -v "^>" command to search all lines that don't start with the > character, which removes line returns with tr -d "\n" so they aren't counted, and then counts the number of characters with the command wc -c. Each command is sent to the next step with the pipe | command.

The results of this command show there are 922 bases. This is important to know for the next step.

cat GFP.fa | grep -v "^>" | tr -d "\n" | wc -c

Now, make a custom GTF for GFP with the following command. This command uses the function echo -e (prints everything in quotes; the -e enables interpretation of the backslash, e.g. \t). Use \t to insert the tabs that separate the 9 columns of information required for GTF.

echo -e 'GFP\tunknown\texon\t1\t922\t.\t+\t.\tgene_id "GFP"; transcript_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";' > GFP.gtf

This is what the GFP.gtf file looks like with the cat GFP.gtf command:

GFP unknown exon 1 922 . + . gene_id "GFP"; transcript_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";

Next, add the GFP.fa to the end of the D. rerio genome FASTA. But first, make a copy so that the original is unchanged.

cp Danio_rerio.GRCz11.dna.primary_assembly.fa Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa

Then, append the GFP.fa to the end of the Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa file. The >> means append. Note: Do not use >, which overwrites the original file.

cat GFP.fa >> Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa

To confirm that the GFP entry was added to the FASTA file, use the grep ">" command to search for lines with the > character:

grep ">" Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa

The output looks similar to the following:

>1 dna:chromosome chromosome:GRCz11:1:1:59578282:1 REF >10 dna:chromosome chromosome:GRCz11:10:1:45420867:1 REF >11 dna:chromosome chromosome:GRCz11:11:1:45484837:1 REF >12 dna:chromosome chromosome:GRCz11:12:1:49182954:1 REF >13 dna:chromosome chromosome:GRCz11:13:1:52186027:1 REF >14 dna:chromosome chromosome:GRCz11:14:1:52660232:1 REF >15 dna:chromosome chromosome:GRCz11:15:1:48040578:1 REF >16 dna:chromosome chromosome:GRCz11:16:1:55266484:1 REF >17 dna:chromosome chromosome:GRCz11:17:1:53461100:1 REF >18 dna:chromosome chromosome:GRCz11:18:1:51023478:1 REF >19 dna:chromosome chromosome:GRCz11:19:1:48449771:1 REF >2 dna:chromosome chromosome:GRCz11:2:1:59640629:1 REF >20 dna:chromosome chromosome:GRCz11:20:1:55201332:1 REF >21 dna:chromosome chromosome:GRCz11:21:1:45934066:1 REF >22 dna:chromosome chromosome:GRCz11:22:1:39133080:1 REF >23 dna:chromosome chromosome:GRCz11:23:1:46223584:1 REF >24 dna:chromosome chromosome:GRCz11:24:1:42172926:1 REF >25 dna:chromosome chromosome:GRCz11:25:1:37502051:1 REF >3 dna:chromosome chromosome:GRCz11:3:1:62628489:1 REF >4 dna:chromosome chromosome:GRCz11:4:1:78093715:1 REF >5 dna:chromosome chromosome:GRCz11:5:1:72500376:1 REF >6 dna:chromosome chromosome:GRCz11:6:1:60270059:1 REF >7 dna:chromosome chromosome:GRCz11:7:1:74282399:1 REF >8 dna:chromosome chromosome:GRCz11:8:1:54304671:1 REF >9 dna:chromosome chromosome:GRCz11:9:1:56459846:1 REF >MT dna:chromosome chromosome:GRCz11:MT:1:16596:1 REF >KN149696.2 dna:scaffold scaffold:GRCz11:KN149696.2:1:368252:1 REF >KN147651.2 dna:scaffold scaffold:GRCz11:KN147651.2:1:351968:1 REF >KN149690.1 dna:scaffold scaffold:GRCz11:KN149690.1:1:343018:1 REF >KN149686.1 dna:scaffold scaffold:GRCz11:KN149686.1:1:260365:1 REF >KN147652.2 dna:scaffold scaffold:GRCz11:KN147652.2:1:252640:1 REF >KN149688.2 dna:scaffold scaffold:GRCz11:KN149688.2:1:252035:1 REF >KN149691.1 dna:scaffold scaffold:GRCz11:KN149691.1:1:233193:1 REF ... >GFP

You can also count the number of contigs in the FASTA. There should now be 994 contigs including the extra GFP.

grep -c "^>" Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa

Use the cp command to make a copy of the original GTF and modify the name to contain GFP. Then use the cat command to append the contents of GFP.gtf to the end of the renamed copy of the filtered D. rerio GTF.

cp Danio_rerio.GRCz11.105.filtered.gtf Danio_rerio.GRCz11.105.filtered.GFP.gtf cat GFP.gtf >> Danio_rerio.GRCz11.105.filtered.GFP.gtf

Check the file with the following command:

tail Danio_rerio.GRCz11.105.filtered.GFP.gtf

The output looks similar to the following with the GTF entry as the last line of the file:

MT RefSeq start_codon 15308 15310 . + 0 gene_id "ENSDARG00000063924"; gene_version "3"; transcript_id "ENSDART00000093625"; transcript_version "3"; exon_number "1"; gene_name "mt-cyb"; gene_source "RefSeq"; gene_biotype "protein_coding"; transcript_name "mt-cyb-201"; transcript_source "RefSeq"; transcript_biotype "protein_coding"; GFP unknown exon 1 922 . + . gene_id "GFP"; transcript_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";

Now use the Danio_rerio.GRCz11.105.filtered.GFP.gtf and Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa files as inputs to the cellranger mkref pipeline:

cellranger mkref --genome=Drerio_genome \ --fasta=Danio_rerio.GRCz11.dna.primary_assembly_GFP.fa \ --genes=Danio_rerio.GRCz11.105.filtered.GFP.gtf

This outputs a custom reference directory called Danio.rerio_genome_GFP/.

10x Genomics provides public datasets for the Rhesus Macaque, Macaca mulatta. Although the reference is not offered for download, you can build it following these instructions. FASTQ and GTF files are downloaded from Ensembl in the "Gene annotation" section (check this page for any reference updates). Please note that we have chosen a "toplevel" or primary assemblies FASTA file because it contains primary contigs and no non-chromosomal or haplotype contigs.

#Download FASTA wget http://ftp.ensembl.org/pub/release-105/fasta/macaca_mulatta/dna/Macaca_mulatta.Mmul_10.dna.toplevel.fa.gz gunzip Macaca_mulatta.Mmul_10.dna.toplevel.fa.gz #Download GTF wget http://ftp.ensembl.org/pub/release-105/gtf/macaca_mulatta/Macaca_mulatta.Mmul_10.105.gtf.gz gunzip Macaca_mulatta.Mmul_10.105.gtf.gz
#Filter GTF cellranger mkgtf \ Macaca_mulatta.Mmul_10.105.gtf Macaca_mulatta.Mmul_10.105.filtered.gtf \ --attribute=gene_biotype:protein_coding \ --attribute=gene_biotype:lncRNA \ --attribute=gene_biotype:antisense \ --attribute=gene_biotype:IG_LV_gene \ --attribute=gene_biotype:IG_V_gene \ --attribute=gene_biotype:IG_V_pseudogene \ --attribute=gene_biotype:IG_D_gene \ --attribute=gene_biotype:IG_J_gene \ --attribute=gene_biotype:IG_J_pseudogene \ --attribute=gene_biotype:IG_C_gene \ --attribute=gene_biotype:IG_C_pseudogene \ --attribute=gene_biotype:TR_V_gene \ --attribute=gene_biotype:TR_V_pseudogene \ --attribute=gene_biotype:TR_D_gene \ --attribute=gene_biotype:TR_J_gene \ --attribute=gene_biotype:TR_J_pseudogene \ --attribute=gene_biotype:TR_C_gene
#Run mkref cellranger mkref \ --genome=Mmul_10 \ --fasta=Macaca_mulatta.Mmul_10.dna.toplevel.fa \ --genes=Macaca_mulatta.Mmul_10.105.filtered.gtf \ --ref-version=1.0.0

10x Genomics provides public datasets for the Norwegian Rat, Rattus norvegicus. Although the reference is not offered for download, you can build it following these instructions. FASTQ and GTF files are downloaded from Ensembl in the "Gene annotation" section (check this page for any reference updates).

#Download fasta wget http://ftp.ensembl.org/pub/release-105/fasta/rattus_norvegicus/dna/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz gunzip Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz #Download GTF wget http://ftp.ensembl.org/pub/release-105/gtf/rattus_norvegicus/Rattus_norvegicus.mRatBN7.2.105.gtf.gz gunzip Rattus_norvegicus.mRatBN7.2.105.gtf.gz
#Filter GTF cellranger mkgtf \ Rattus_norvegicus.mRatBN7.2.105.gtf Rattus_norvegicus.mRatBN7.2.105.filtered.gtf \ --attribute=gene_biotype:protein_coding \ --attribute=gene_biotype:lncRNA \ --attribute=gene_biotype:antisense \ --attribute=gene_biotype:IG_LV_gene \ --attribute=gene_biotype:IG_V_gene \ --attribute=gene_biotype:IG_V_pseudogene \ --attribute=gene_biotype:IG_D_gene \ --attribute=gene_biotype:IG_J_gene \ --attribute=gene_biotype:IG_J_pseudogene \ --attribute=gene_biotype:IG_C_gene \ --attribute=gene_biotype:IG_C_pseudogene \ --attribute=gene_biotype:TR_V_gene \ --attribute=gene_biotype:TR_V_pseudogene \ --attribute=gene_biotype:TR_D_gene \ --attribute=gene_biotype:TR_J_gene \ --attribute=gene_biotype:TR_J_pseudogene \ --attribute=gene_biotype:TR_C_gene
#Run mkref cellranger mkref \ --genome=mRatBN7 \ --fasta=Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa \ --genes=Rattus_norvegicus.mRatBN7.2.105.filtered.gtf \ --ref-version=1.0.0