10x Genomics Support/Cell Ranger/Downloads/

Prebuilt V(D)J References

This page descibes the prebuilt human and mouse V(D)J reference sequences available for download from 10x Genomics. These reference sequences were generated using V(D)J genes from Ensembl build 94, employing both gtf and gff3 files, each offering unique details. The compilation process involves automated steps supplemented by manual edits, all of which are described on this page. In two instances, we have introduced unofficial gene names; these are placeholders that should eventually be updated to their official counterparts.

As a result of the manual edits and mechanical processing, all V gene sequences now begin with a start codon ATG. The start codon lies in the leader sequence coding for a signal peptide that is cleaved off.

Pseudogenes have been generally excluded from the prebuilt references. However, exceptions are made for cases where we contest the accuracy of pseudogene classification. These exceptions are detailed further below.

  • Deleted TRBV6-3 because it is not in the Ensembl GTF and is nearly identical to TRBV6-2.

  • Deleted IGHV1/OR15-9 because they are labeled non-functional by NCBI.

  • Allowed TRAJ8, TRAV35 and TRBV21-1 even though they are labeled pseudogenes, because they are observed in productive pairs.

  • Added TRAJ15, TRBD2, TRBV11-2 and TRGV11 because they are observed in productive pairs.

  • Added IGHV1-8, IGKV2-18 and IGLV6-57 because they are observed in productive pairs.

  • Added 1 base to the right of TRAJ36 because otherwise annotations fail the in-frame requirement for productive contig, which all other observed human and mouse J genes satisfy.

  • Trimmed 3 bases from the right of TRAJ37 because otherwise one finds a three-base indel when one annotates the J/C junction (in observed data).

  • Trimmed 89 bases from the left of IGLJ1, because these bases are not part of the actual J segment.

  • Trimmed 104 bases from the left of IGLJ2, because these bases are not part of the actual J segment.

  • Trimmed 113 bases from the left of IGLJ3, because these bases are not part of the actual J segment.

  • Trimmed 57 bases from the left of TRBV20/OR9-2, because these bases are not part of the actual V segment.

  • Added an alternate form of TRBV20-1, differing from the reference by a 3-base insertion, because we observe this form (here and below, we use the same name for the alternate inserted form).

  • Added an alternate form of TRBV7-7, differing from the reference by a 15-base insertion, because we observe this form.

  • Add an allele of the gene IGHJ6.

  • Remove the first base of the C region in certain cases. In these cases we observe that in most transcripts, the J region and C region overlap by exactly one base.

  • Replace IGKV2D-40, whose leader sequence appears truncated.

  • Delete IGKV2-18, although we had previously added it. It is probably a pseudogene.

  • Delete IGLV5-48. It is truncated on the right.

  • Delete TRBV21-1, which has multiple frameshifts.

  • Add IGHV4-30-4, which was missing.

  • Add IGKV1-NL1, which was missing.

  • Add IGHV4-38-2, which was missing.

  • Deleted IGHV1-67, because it is labeled a pseudogene by NCBI.

  • Added IGHV12-1, because we observe this gene in data.

  • Added a V gene observed in two BALB/c datasets, aligning to an unplaced sequence in the BALB/c whole genome assembly, and we unofficially named this gene IGHV1-unknown1.

  • Added a form of TRAV4-4-DV10 seen in BALB/c.

  • Added a form of TRAV13-1 or TRAV13D-1 seen in BALB/c, and arbitrarily labeled TRAV13-1.

  • Added a very common alternate splicing between the first exon of TRBV12-2 and the second exon of TRBV13-2, which we unofficially named TRBV12-2+TRBV13-2.

  • Added an alternate form of TRAV16N, differing from the reference form by having a 3-base insertion, because we observe this in data.

  • Added an alternate form of TRAV6N-5, differing from the reference by a 3-base insertion, because we observe this in data.

  • Added an alternate form of TRAV13N-4, differing from the reference by a 15-base insertion, because we observe this in data.

  • Added an alternate form of TRBV13-2, differing from the reference by a 21-base insertion, because we observe this in data.

  • Remove the first base of the C region in certain cases. In these cases we observe that in most transcripts, the J region and C region overlap by exactly one base.

  • Delete TRAV23, which is frameshifted.

  • Delete the first base of the constant region gene IGHG2B.

  • Make a six-base insertion in IGKV12-89, based on empirical data.

  • Correct IGHV8-9, whose amino acid sequence showed the canonical C at the end of FWR3 as S. This is consistent with observations from 10x data.

  • Add a missing allele of IGKV2-109.

  • Add missing gene IGKV4-56.

  • Add missing gene IGHV1-2.