Guest Author: Jennifer MacArthur
10x Genomics is making single cell RNA sequencing more accessible, and optimizing high-throughput methods to measure other single cell modalities, like cell-surface protein, chromatin accessibility, copy number variation, and more. The best way to realize the power of that multiomic information is to turn those disparate single cell data points into one unified dataset.
In a recent 10x webinar, Dr. Rahul Satija explained how you can harmonize single cell sequencing data across technologies. His lab at New York Genome Center and NYU developed Seurat v3 as an R package for single cell genomics. This toolkit helps users identify and interpret sources of heterogeneity from single cell transcriptomic measurements, and integrate diverse types of single cell data. Learn more and watch the webinar on demand →
We know integration of single cell datasets, across technologies or data modalities, is of great interest to the single cell community and that our customers will find tremendous value in Seurat v3. Here are some answers to common questions, based on the Q&A with Dr. Satija, and his recent publication, "Comprehensive Integration of Single-Cell Data" (Stuart T et al. Cell. 2019).
How does dataset integration with Seurat v3 work?
The strategy for integration starts with identifying matching cell pairs across datasets. These "anchors" represent a similar biological state, weighted based on the overlap in their nearest neighbors. In context of these anchor cell relationships, Seurat v3 transforms the datasets into a shared space. This creates a reference to transfer data and metadata from one experiment to another. This method only requires there be SOME overlap in cell populations between datasets. It performs robustly when datasets are produced with different technologies, and even when they are generated from different biological contexts.
How do you integrate single cell ATAC-seq data with single cell gene expression?
To find anchors for integrating single cell ATAC-seq and RNA-seq data, the user starts with the assumption that chromatin accessibility is positively correlated with gene expression. Seurat determines "gene activity" based on open chromatin reads in gene regulatory regions and identifies matching cells in the single cell RNA-seq dataset. Even if only a subset of genes exhibit coordinated behavior across RNA and chromatin modalities, Seurat v3 can still perform effective integration.
How do you really know if the integration analysis worked?
The integration and classification are based on probabilities. Each individual cell classification and each anchor has a prediction score for how confident Seurat is in making the call. Those numbers can give you an estimate for how well the integration analysis worked. For example, if you were to integrate two datasets that had no shared biological populations between them, you would very likely get some anchors, just because of experimental noise, but all of them would have very low confidence scores.
Exploring the underlying molecular data in each dataset independently is an important step in interpreting the results of an integration analysis. Even for high confidence calls, you need external validation to be certain that the integration is biologically meaningful.
When can you get away with just merging datasets without integration?
You don’t necessarily need to run integration analysis every time you have multiple datasets. For example, if you are doing different runs of the same experiment, it may be faster to normalize and merge the data directly. Even in these cases, significant batch effects often make direct analysis difficult. Batch effects can originate from a number of different sources, including sequencing depth. The Seurat v3 integration procedure effectively removes technical distinctions between datasets while ensuring that biological variation is kept intact.
There are lots of reasons why you may need help to match cell populations across multiple datasets. You might want to compare how the same cell type responds to different experimental perturbations, or across different experimental samples. You may want to do cross species analysis or cross modality analysis. Ultimately, integrating diverse modalities associated with single cell sequencing datasets can power a more holistic understanding of cellular identity and function.