The case of the missing T cells: why integrated multimodal data holds the key
“We want to be able to move beyond a descriptive analysis. We want to move beyond just saying, how many cell types do we have in our sample? And instead, we want to start to identify the key transcription factors and genetic regulators that influence the system. And that's what paired RNA and ATAC-seq data can help us start to do.”
Dr. Rahul Satija, Core Faculty Member, New York Genome Center
Dr. Rahul Satija, head of the Satija Lab and Core Faculty Member at the New York Genome Center, shared these insights in a recent webinar introducing the capabilities of Seurat v4, the latest version of his lab's R toolkit for the analysis and integration of single cell datasets. Review his talk to learn how integrating multimodal single cell datasets, including gene expression and chromatin accessibility data, can unmask not only previously hidden cellular heterogeneity in your samples, but also key epigenetic regulators of cellular identity and function.
Why a one-sided view of biological complexity won’t cut it
What defines a cell? The answer is perhaps as dynamic as a cell itself, and recent studies suggest that a single perspective, whether a view of the cell’s whole transcriptome or of a few select surface proteins, is not enough to capture the full picture of cellular heterogeneity. We know, for example, that a sample of human peripheral blood mononuclear cells (PBMC) not only contains diverse cell types—T cells, B cells, natural killer cells, monocytes, and dendritic cells, each with a unique catalog of canonical cell surface proteins—but also dynamic cell states. These are determined by biological features that go deeper than the surface of the cell. As researchers seek to more clearly define cellular identity and characterize dynamic cell states in their samples, single cell multiomics—the analysis and integration of datasets from different omic groups—will be essential for discovery. But how can researchers balance the perspective provided by the proteome with that provided by the transcriptome or the epigenome? Does one ‘ome’ have more weight in defining a cell’s identity than another?
Dr. Rahul Satija and his team have developed statistical models that address these very questions through the integration of multimodal data types, or single cell multiomics. In a recent webinar, now available on-demand, Dr. Satija described a study of the cellular contents of a human bone marrow sample via CITE-seq, a method to annotate cell surface proteins using pre-selected antibody panels alongside whole transcriptome single cell gene expression from the same cells. The bone marrow includes a mix of differentiated and undifferentiated cells. While the antibody panel his team used would identify specific surface proteins for differentiated cells, including various T-cell markers, it excluded any markers for undifferentiated cells present in the bone marrow, including progenitor populations like hematopoietic stem cells and myelocytic progenitors. Because of this, they found that the separate protein and RNA readouts had different strengths regarding specific cell type annotation. As Dr. Satija described, “we saw lots of beautiful T-cell clusters in the protein data, but they all kind of form a blob in the RNA analysis. On the flip side, we have lots of nice progenitor clusters in the RNA data, but they're not nearly as distinct in the protein data.”
What’s the solution to harmonize these two datasets? Dr. Satija described a statistical workflow called weighted nearest neighbor analysis, enabled through an algorithm that runs in Seurat v4. Using this tool, each cell in a sample is assigned a modality weight that determines a relative weighting between how useful different data types are to define cellular identity, in this case, RNA and protein data. “If we take a weighted combination of the modalities, we end up with a single representation of the dataset, which we call a weighted nearest neighbor graph that we hope encompasses the richness of both data types together. And we can use that single representation to draw a single UMAP plot or to derive a single clustering.” In their study of bone marrow, weighted nearest neighbor integrated analysis enabled clear separation of both progenitor subtypes and T-cell subsets, leveraging “the best of both worlds” from the two data types to clearly define cell type and state.
For more information about weighted nearest neighbor analysis, review the vignette from the Satija lab here.
Pinpointing the regulators in a sea of genes
Dr. Satija described the application of weighted nearest neighbor analysis to the integration of single cell gene expression and chromatin accessibility data as well. With the new Chromium Single Cell Multiome ATAC + Gene Expression product from 10x Genomics, researchers can simultaneously measure whole transcriptome gene expression and regions of chromatin accessibility in the same single cells. In his words, this “provides a window into the cell's regulatory landscape. And understanding how a cell's regulatory landscape affects gene expression is really a fundamental challenge in biology.”
Once again, the addition of another data type on top of single cell gene expression data supports refined cellular annotation. In a study of paired, single cell RNA-seq and ATAC-seq data from PBMCs, Dr. Satija and his team noted that single cell ATAC-seq data helped to clearly separate T-cell subtypes. Examining RNA and ATAC data across a cluster of cells, they plotted the chromatin accessibility profile for the CD8A locus and compared it to the expression of the RNA gene, CD8A.
“We expected the RNA to be the most informative modality most of the time, just because in the ATAC-seq data there's only two copies of DNA, and so it's usually even more sparse than single cell RNA data. But we actually find that the ATAC-seq data injects some very useful information into the analysis. [...] It's really obvious from the ATAC-seq data which clusters represent CD8 T cells, and which clusters don't. In fact, it's actually even more obvious in the ATAC data than it is in the RNA data, which is remarkable.”
In addition to refining cell state annotation, ATAC-seq data can help to uncover the epigenetic regulators that control dynamic cell states. Dr. Satija offered a potentially familiar scenario, describing his team’s efforts to look for regulator genes with only single cell gene expression data:
“You don't know which genes are actually important regulators, which are the pioneers that regulate cellular identity, and which ones don't. [We have] tried to pick a few genes to follow-up on for functional analysis [...] and we basically just have to guess at random. [...] That's where the ATAC-seq data is exciting because you can look for motifs that are enriched for peaks that are accessible for a cell type. And that suggests that a transcription factor is binding them. So that's a hypothesis for a functional experiment.”
Moreover, with the ability to directly link gene expression and chromatin accessibility in the same cells, researchers can not only confirm the functional role of known transcription factors, but also discover new regulators with cell type specificity. Dr. Satija offered the transcription factor PU.1 as an example. “We know that transcription factor is essential for the development and maintenance of monocytes. Now, if we look at the expression of PU.1, we can see that it's specifically and most highly expressed in monocytes. So that makes sense. But what's remarkable is that since we have paired RNA and ATAC data, we can also look at the accessibility for the PU.1 motifs. [...] What we see is that the same cells that express PU.1 at the RNA level, they also have enriched accessibility for the PU.1 motif at the DNA level. So those are two completely independent and complementary lines of evidence that PU.1 really is playing an important role in monocytes.” This analytical approach in turn points to an “effective heuristic for nominating what we think are the most important regulators of any cell type from paired RNA and ATAC-seq data.”
Expanding regulator discovery potential with paired RNA and ATAC data
The datasets Dr. Satija’s teams analyzed reinforced the functional role of canonical transcription factors in the development, maturation, and maintenance of particular immune cells states. But the possibilities don’t end there. He expressed excitement for potential discoveries across different cell populations, systems, and organs. “If you generate paired single cell RNA and ATAC data in your system of interest, the brain, the kidney, the heart, the lung, the liver, the spleen, you may be able to make new discoveries there.” Indeed, whatever systems researchers are examining, from brain tissue to tumors, there is huge potential to unlock the regulators that determine dynamic cellular subtypes and states, in both health and disease.
Thank you to Dr. Satija for his inspiring insights into the power of integrating multimodal data and its application to fundamental biological questions. To learn more about weighted nearest neighbor analysis, pioneered in the latest update to Seurat v4, read the preprint from the Satija lab. If you’d like to get started with your own multimodal analysis, walk through the Seurat v4 vignette leveraging data derived from a study of PBMCs with the Chromium Single Cell Multiome ATAC + Gene Expression product from 10x Genomics.
And finally, review Dr. Satija’s webinar at your convenience with this on-demand recording.