Note: 10x Genomics does not provide support for community-developed tools and makes no guarantees regarding their function or performance. Please contact tool developers with any questions. If you have feedback about Analysis Guides, please email analysis-guides@10xgenomics.com.
Annotating cells in single-cell gene expression data is a challenge that researchers are actively tackling. Defining a cell type can be challenging because of two fundamental reasons. One, gene expression levels are not discrete and mostly on a continuum; and two, differences in gene expression do not always translate to differences in cellular function (Pasquini et al.). The field is growing and rapidly changing due to the advent of different kinds of annotation tools, and the creation of multiple scRNA-seq databases (Wang et al.). In this article, we want to share some popular web-based resources that you may find useful in order to annotate your own datasets.
Here we present two types of web resources for cell type annotation. They are relatively easy to use and do not require advanced level scripting or programming skills. Researchers sometimes try both methods with multiple tools when performing the cell type annotation. In addition, it is also recommended to validate your annotation by experiments, statistical analysis, or consulting experts.
- Resources for automated cell type annotation (reference-based): There are tools developed by the community for automatically annotating cells by comparing new data with existing references. In this article, we will highlight a few web tools that do not require any programming skills. A major limitation of these tools is that the quality of the results heavily depends on the quality of the pre-annotated reference datasets.
- Databases with marker genes for manual annotation: These databases contain marker genes for various cell types (mostly human and mouse). Using the marker genes in these databases, you can annotate the cells in your datasets. For example, if you are using 10x Genomics Loupe Browser, you will be able to find top differentially expressed genes for each cluster. Then you can search the genes in the database to find out if they are marker genes for specific cell types. You may need to search multiple top genes in each cluster to be sure about the cell type. Depending on the complexity of the dataset and prior knowledge on the cell types, this process could be laborious and time-consuming.
10x Genomics Automated Cell Annotation
Cell Ranger software can automatically annotate cells in single cell gene expression assays. This feature annotates human and mouse gene expression datasets in the Universal 3' and 5' and Flex assays (Cell Ranger v9+), and Epi Multiome ATAC + Gene Expression (Cell Ranger ARC v2.1+) assays. Briefly, each cell barcode's gene expression profile is converted to a lower-dimensional embedding. The 500 nearest neighbors are selected from the Chan Zuckerberg CELLxGENE census, and a consensus vote is used to identify the most similar cell types for human and mouse. To learn more about the algorithm and the models, visit Cell Ranger's Cell Annotation Algorithm page.
Annotations on processed dataset can be obtained by using the standalone annotate pipeline, either on the 10x Cloud or command line interface. Details on the output files can be found on the Cell Ranger annotation outputs page.
The annotation works best for sample types that are present in the CELLxGENE database. For samples that are not present in the database or are expected to have unusual expression profiles, such as cancer samples, we suggest that users review the annotation results carefully, particularly if unexpected cell types are seen.
Azimuth
Azimuth is a web application that uses a reference-based pipeline that performs normalization, visualization, cell annotation, and differential expression. The input file can be the feature-barcode matrix output from Cell Ranger. Currently it supports various human tissues, such as motor cortex, PBMCs, fetal development, bone marrow, and kidney. For mouse tissues, it has motor cortex as well as an organismal aging atlas. References are also available for scATAC-seq queries for human PBMCs and human bone marrow. An advantage of Azimuth is that it uses the popular Seurat algorithm and requires no installation or programming experience.
Tabula Sapiens
Tabula Sapiens provides an easy-to-use, reference-based pipeline that users can utilize to annotate their own data. Users can launch a web-based application for their the full dataset in the repository, or for specific functional compartments, such as neuronal cells, or immune cells. This human cell atlas contains transcriptomic data of 28 organs from 24 normal human subjects.
Irrespective of the availability of high quality tools and databases, annotation of clusters or cells from single cell assay remains difficult and challenging. This is due to multiple factors, such as different experimental designs, differences in normal versus diseased tissues etc. In addition, there might be conflicting information for multiple sources and/or longer than usual marker gene lists. In these cases researchers can opt to look at marker genes that overlap between different sources, and clarify further via literature search. Satisfactory annotation may sometimes require extensive literature search, as well as functional assays on the bench-side.
Datasets from MSigDB
There are two curated datasets from single cell sequencing studies: C8 for human tissue & M8 for mouse tissue. Tissues curated in the database include, but are not limited to: heart, immune system, brain, retina, and pancreas. All MSigDB datasets are regularly updated by the funded curators. The gene sets can be used to determine cell types through the GSEA platform, which can be run via a lightweight desktop application or R. Using the desktop application (documentation) is relatively easy, but requires reformatting of the gene expression data from each cluster that needs to be annotated. Carefully evaluate the outputs by studying the GSEA reports and statistics before deciding on a particular annotation.
Tabula Muris
Tabula Muris is a repository of single cell RNA-seq transcriptome data from mouse. A highly cited web-based database, it contains data on 20 different kinds of mouse organs and tissues. To annotate your own datasets, you can select the relevant tissue type and input the top differentially expressed genes in your clusters to find which type of cells the clusters are most likely composed of. One caveat is that the genes have to be input one at a time, instead of as an entire list.
CellMarker 2.0
CellMarker 2.0 is a manually curated resource of cell type markers in human and mouse, from >100k publications. It has a very user-friendly interface, and was last updated in September 2022. This newly updated database has several new features, including 36,300 tissue-cell type-marker entries, and 29 additional types of cell markers, such as processed pseudogene and lncRNA.