Quality control (QC) of single cell RNA-seq data is an important step before moving on to a variety of downstream analyses and making biological conclusions. The major goals of performing QC and filtering include:
- Generating metrics that help assess the sample quality and decide whether to proceed to downstream analyses.
- Removing poor quality data and noises that may confound analysis and interpretation.
Here, we discuss the most common metrics and methods used for QC and filtering of cell barcodes in single cell RNA-seq data for gene expression analysis. Most of the methods listed here impact the cell barcodes included for downstream analysis, which may in turn change the clustering results and visualization.
If you are analyzing single cell RNA-seq data generated using 10x Genomics technologies, the following two steps are done before we start filtering the data.
- Run Cell Ranger to process the raw data. Cell Ranger is a set of analysis pipelines that process Chromium single cell data to align reads, generate feature-barcode matrices, perform clustering and other secondary analysis, and more. Please visit this page for more information. The output feature-barcode matrix will contain the data used for QC and filtering of cell barcodes.
- Plot the distribution of potential filtering metrics, if possible. Before we even decide the metrics and the thresholds to be used for filtering, a good practice is to visualize the distribution of the data to gauge the overall data quality and check for unexpected phenotypes. Some examples of these metrics are listed in the next section. Some types of plots that are great for visualizing quality metrics include violin plot, box plot, or density plot.
Here we discuss the common metrics and methods used in publications and popular online tutorials for filtering 10x Genomics single cell data. For each filter, we briefly explain the rationale behind using the metric and the potential caveat of the filter when possible. For the filters discussed in points 1 to 3 below, the filtering can be performed using various community developed packages (including Seurat and Scanpy) or the Recluster functionality in the Loupe Browser. For other filters and methods, we provide a few examples of community-developed tools for researchers to explore.
1. Filtering cell barcodes by UMI counts: The total UMI counts associated with a cell barcode represent the absolute number of observed transcripts in the droplet. Barcodes associated with unusually high UMI counts might be multiplets (i.e. one droplet containing multiple cells), whereas barcodes with low UMI counts might be droplets containing ambient RNAs but not real cells. Therefore, using UMI counts to filter cell barcodes may help to eliminate barcodes that do not represent a single cell. The choice of UMI count thresholds in published literature can vary between arbitrary cutoffs or the use of data-driven threshold, e.g. three to five times of standard deviation or median absolute deviation from the median (You et al., 2021, Ocasio et al., 2019). In Cell Ranger, UMI count is capped at 500 in the second step of cell calling - barcodes with less than 500 UMI counts will not be regarded as cells. However, when the sample is highly heterogeneous, one threshold on UMI count for the whole sample may not always be suitable as it may eliminate real single cells with very high or low RNA contents, for example, neutrophils (Document CG000444).
2. Filtering cells by number of features: Similar to the observation with UMI counts, barcodes associated with unusually high number of features might be multiplets, whereas barcodes with low number of features might be droplets containing ambient RNAs but not real cells. Some publications use arbitrary cutoffs, while some may use a data-driven threshold, e.g. two to five times of standard deviation or median absolute deviation from the median (You et al., 2021, Ocasio et al., 2019). However, when the sample is highly heterogeneous, one threshold on the number of features may not always be suitable because it may eliminate real single cells that express a high variety or low number of genes.
3. Filtering cells by percent of mitochondrial (mt) reads: An increased level of transcripts from mt DNA in cells has been associated with unhealthy cell states (Osorio and Cai 2021). It can also be a result of broken cells where the cytoplasmic RNAs leaked out of the cell while the mt RNAs retained in mitochondria are captured in the single cell assay (Ilicic et al., 2016). For the thresholds on the percent of mt reads, some publications use arbitrary cutoffs, while some use a data-driven threshold: three to five times of standard deviation or median absolute deviation from the median (You et al., 2021, Ocasio et al., 2019). It is worth noting that the expression level of mt genes can vary among samples (Osorio & Cai, 2021). For some cells types (e.g. cardiomyocytes), expression of mt genes may have biological meanings and filtering cell barcodes based on this may lead to bias in the analysis.
4. Filtering cells by doublet detection using community tools: The presence of doublets or multiplets may confound the analysis as these multiplets can contain more than one cell from different cell types. In addition to using UMI counts and number of features per cell for filtering multiplets, there are a number of community developed software tools for identifying multiplets in single cell data. Many of the algorithms used in these tools generate artificial doublets and calculate a doublet score by comparing the gene expression profiles of barcodes in the data with artificial doublets. Some examples of tools include DoubletFinder, Scrublet, Solo. Setting the threshold on the doublet score for filtering can be subjective and data dependent, so it is recommended to check the doublet score distribution, as mentioned in the best practices for using Scrublet. The benchmarking study on computational doublet-detection methods for single cell RNA sequencing data (Xi & Li, 2021) could also be a good resource for understanding the performance of various doublet detection tools.
5. Identifying and removing empty droplets based on the expression profile: The 10x Genomics Chromium technology is designed such that most droplets, or Gel Beads-In-EMulsions (GEMs), contain 1 cell or less (e.g. ambient RNA in solution or empty GEMs). As a result, there are a fair amount of empty droplets that do not contain an intact cell. To distinguish cell-containing droplets from the empty droplets, Lun et al. developed a method called emptyDrops (Lun et al., 2019), which is adopted in Cell Ranger for cell calling. The emptyDrops method first derives an "ambient profile" based on the gene expression profile from droplets with a small UMI count. Barcodes in the data with a significantly different profile from the ambient profile are regarded as cells. Other software tools for detecting empty droplets or cell debris based on expression profile include EmptyNN, CellBender, and DIEM.
6. Removing ambient RNAs associated with barcodes: In droplet-based methods, it is possible that ambient RNAs are enclosed in a droplet together with an intact cell. Ambient RNAs could come from contaminants or RNAs released from unhealthy cells. As a result, the contamination from ambient RNAs can distort the UMI counting and downstream analysis of gene expressions (Caglayan et al., n.d.). Several software tools have been developed to remove ambient RNA signal in single cell RNA-seq data, including SoupX, DecontX, and CellBender.
Things to watch out for:
- The impact of filtering and the resulting data quality can only be judged based on the performance of downstream analyses. Therefore, this can be an iterative process. It may be helpful to begin with permissive filtering approaches, and then revisit the filtering parameters if the downstream analysis results cannot be interpreted (Luecken & Theis, 2019, Germain et al., 2020). In some cases, it may also be beneficial to perform rough cell type annotation before filtering to avoid filtering out biologically meaningful cells.
- There is not one set of thresholds for the metrics listed above that will apply to all datasets. Tutorials mention thresholds as starting places. For example, the Seurat Guided Clustering Tutorial mentions arbitrary filters for the demonstrated dataset, e.g. filtering cells with unique feature counts over 2,500 or less than 200 and filtering cells with >5% mitochondrial counts. Publications often list thresholds that have been used based on the data. However, this does not imply that these are filtering methods that should commonly be applied to every dataset. The type and number of filters applicable to any dataset is highly dependent on the sample and cell type as well as the biological questions. Different tissues or cell types may display distinct characteristics (e.g. heterogeneity in RNA content and mitochondrial gene expression). The same set of filters may not be appropriate for all cell types. In some cases, cluster-specific QC and filtering within the same dataset may be beneficial (Schmidt et al., 2021). If available, reviewing the literature for single cell experiments with similar samples or cell types can help gauge filtering parameters that may be needed.
- Technical Note: Neutrophil Analysis in 10x Genomics Single Cell Gene Expression Assays (CG000444)
- Caglayan, Emre, Yuxiang Liu, and Genevieve Konopka. "Ambient RNA analysis reveals misinterpreted and masked cell types in brain single-nuclei datasets." bioRxiv (2022).
- Germain, Pierre-Luc, Anthony Sonrel, and Mark D. Robinson. "pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single cell RNA-seq preprocessing tools." Genome Biology 21.1 (2020): 1-28.
- Ilicic, Tomislav, et al. "Classification of low quality cells from single-cell RNA-seq data." Genome biology 17.1 (2016): 1-15.
- Luecken, Malte D., and Fabian J. Theis. "Current best practices in single‐cell RNA‐seq analysis: a tutorial." Molecular systems biology 15.6 (2019): e8746.
- Lun, Aaron TL, et al. "EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data." Genome biology 20.1 (2019): 1-9.
- Ocasio, Jennifer Karin, et al. "scRNA-seq in medulloblastoma shows cellular heterogeneity and lineage expansion support resistance to SHH inhibitor therapy." Nature communications 10.1 (2019): 1-17.
- Osorio, Daniel, and James J. Cai. "Systematic determination of the mitochondrial proportion in human and mice tissues for single-cell RNA-sequencing data quality control." Bioinformatics 37.7 (2021): 963-967.
- Schmidt, Florian, et al. "RCA2: a scalable supervised clustering algorithm that reduces batch effects in scRNA-seq data." Nucleic Acids Research 49.15 (2021): 8505-8519.
- Xi, Nan Miles, and Jingyi Jessica Li. "Benchmarking computational doublet-detection methods for single-cell RNA sequencing data." Cell systems 12.2 (2021): 176-194.
- You, Yue, et al. "Benchmarking UMI-based single-cell RNA-seq preprocessing workflows." Genome biology 22.1 (2021): 1-32.