Annotating cells in single-cell gene expression data is a challenge that researchers are actively tackling. Defining a cell type can be challenging because of two fundamental reasons. One, gene expression levels are not discrete and mostly on a continuum; and two, differences in gene expression do not always translate to differences in cellular function (Pasquini et al.). The field is growing and rapidly changing due to the advent of different kinds of annotation tools, and the creation of multiple scRNA-seq databases (Wang et al.). In this article, we want to share some popular web-based resources that you may find useful in order to annotate your own datasets.
Here we present two types of web resources for cell type annotation. They are relatively easy to use and do not require advanced level scripting or programming skills. Researchers sometimes try both methods with multiple tools when performing the cell type annotation. In addition, it is also recommended to validate your annotation by experiments, statistical analysis, or consulting experts.
1. Databases with marker genes for manual annotation: These databases contain marker genes for various cell types (mostly human and mouse). Using the marker genes in these databases, you can annotate the cells in your datasets. For example, if you are using 10x Genomics Loupe Browser, you will be able to find top differentially expressed genes for each cluster. Then you can search the genes in the database to find out if they are marker genes for specific cell types. You may need to search multiple top genes in each cluster to be sure about the cell type. Depending on the complexity of the dataset and prior knowledge on the cell types, this process could be laborious and time-consuming.
2. Resources for automated cell type annotation (reference-based): There are tools developed by the community for automatically annotating cells by comparing new data with existing references. In this article, we will highlight a few web tools that do not require any programming skills. A major limitation of these tools is that the quality of the results heavily depends on the quality of the pre-annotated reference datasets.
PanglaoDB is a database with single-cell datasets and lists of cell-type markers for various tissue types for both human and mouse. It has easy to follow usage examples, and it is possible to explore cell types with a list of marker genes. Cell markers can be voted by the community, so you can prioritize markers that have been upvoted.
As of June 2023, there are two curated datasets from single-cell sequencing studies: C8 for human tissue & M8 for mouse tissue. Tissues curated in the database include, but are not limited to: heart, immune system, brain, retina, and pancreas. All MSigDB datasets are regularly updated by the funded curators. The gene sets can be used to determine cell types through the GSEA platform, which can be run via a lightweight desktop application or R. Using the desktop application (documentation) is relatively easy, but requires reformatting of the gene expression data from each cluster that needs to be annotated. The outputs generated need to be carefully evaluated by studying the GSEA reports and statistics generated, before deciding on a particular annotation.
Tabula Muris is a repository of single cell RNA-seq transcriptome data from mouse. A highly cited web-based database, it contains data on 20 different kinds of mouse organs and tissues. To annotate your own datasets, you can select the relevant tissue type and input the top differentially expressed genes in your clusters to find which type of cells the clusters are most likely composed of. One caveat is that the genes have to be input one at a time, instead of as an entire list.
CellMarker 2.0 is a manually curated resource of cell type markers in human and mouse, from >100k publications. It has a very user-friendly interface, and has been last updated in September 2022. This newly updated database has multiple new features, such as addition of 36,300 tissue-cell type-maker entries, and addition of 29 types of cell markers, including processed pseudogene, lncRNA etc.
Additional gene marker databases for specific tissue type:
Azimuth is a web application that uses a reference-based pipeline that performs normalization, visualization, cell annotation, and differential expression. The input file can be the feature-barcode matrix output from Cell Ranger. Currently, it supports seven types of human tissue (PBMC, motor cortex, pancreas, fetal development, and lung) and one type of mouse tissue (motor cortex). An advantage of Azimuth is that it uses the popular Seurat algorithm, and requires no installation or programming experience.
Tabula Sapiens provides an easy-to-use, reference-based pipeline that users can utilize to annotate their own data. The PopV “Launch collab session” provides access to a Python Jupyter Notebook on Google Collaboratory, with very detailed notations and instructions. This human cell atlas contains transcriptomic data of 24 organs from 15 normal human subjects. From the homepage of the consortium, users can also further investigate the gene signature in each type of organ.
SciBet is a reference-based, web application tool that users can leverage to annotate their datasets. A CSV-formatted gene expression normalized file is required as an input for this online tool. The developers also provide a series of trained references (human and mouse) that you can directly use to annotate your own data.
Irrespective of the availability of high quality tools and databases, annotation of clusters or cells from single-cell assay remains difficult and challenging. This is due to multiple factors, such as different experimental designs, differences in normal versus diseased tissues etc. In addition, there might be conflicting information for multiple sources and/or longer than usual marker gene lists. In these cases researchers can opt to look at marker genes that overlap between different sources, and clarify further via literature search. Satisfactory annotation may sometimes require extensive literature search, as well as functional assays on the bench-side.