Analysis Guides
Oct 13, 2021

Navigating 10x Genomics Barcoded BAM Files

Share via:

Note: 10x Genomics does not provide support for community-developed tools and makes no guarantees regarding their function or performance. Please contact tool developers with any questions. If you have feedback about Analysis Guides, please email [email protected].

Introduction

Cell Ranger's gene expression algorithm aligns and filters reads to determine which UMIs are included in the filtered feature-barcode matrix to be used in downstream analyses. This tutorial walks through one method for obtaining the counts from the filtered feature barcode matrix starting with the 10x Genomics BAM file (i.e. possorted_genome_bam.bam). Cell Ranger generates two matrices as output from the pipeline. This tutorial will focus on the filtered version.

TypeDescription
Unfiltered feature-barcode matrixContains every barcode from the fixed list of known-good barcode sequences that has at least 1 read.
Filtered feature-barcode matrixContains only detected cellular barcodes.

The filtered feature-barcode matrix has the cell barcode as the column header, feature name as the row header and the feature count per barcode as the body of the matrix.

This tutorial will not rebuild the matrix but instead it will provide the information needed to do so if desired.

Prerequisite Skills

  • Basic knowledge of Universal Molecular Identifiers (UMIs)
  • Familiarity with the Linux command line
  • Familiarity with the BAM/SAM standard
  • Capacity to run Cell Ranger, see System Requirements

Software and Packages Used

  • samtools (included in Cell Ranger install)

Learning Goals/Objectives

  • Identify differences between the filtered and unfiltered barcode matrixes
  • Understand the structure of Filtered Feature Barcode Matrix
  • Use the xf flag to extract BAM records that are used for UMI counting
  • Determine the number of times a feature is counted for a cell associated barcode

Data

The data used in this tutorial came from the 10x Genomics public datasets repository and they represent 1000 Peripheral Blood Mononuclear Cells (PBMCs) from a Healthy Donor. Libraries were generated with 10x Genomics 3' Single Cell Gene Expression v3 chemistry and data were analyzed with Cell Ranger version 3.0. These data were selected for demonstration purposes. Larger datasets may take longer to analyze.

1k PBMCs from a Healthy Donor (v3 chemistry)

File TypeFile Sizemd5sum
Genome-aligned BAM4.79 GBb47f986a532d4c4ecd351511b345dae8
Feature / cell matrix (filtered)9.61 MB0bb2f675b59c3d208200aea57a31044b

Please find data download links and instructions below in the 'Full tutorial'.

Format

The commands in this tutorial are meant to be copy/pasted.

Some of the commands listed in this tutorial contain pipes "|", which sends the output of the first command to the input of the second command without writing the output to disk. This is especially useful in bioinformatics when the datafiles that we work with are large. For example, this will list the files in the current directory, then count them.

ls -l | wc -l

For some of the long conmands containing pipes in this tutorial, we dissect each piece of the command for you using a comment, noted with a "#". The lines preceded by a # are not meant to be run. But if you accidentally run them, nothing will be executed because the # at the beginning makes everything after it silent. On these comment lines, the piece of the command is listed on the left followed by the description of what it does on the right following a tab, like this:

# samtools view pbmc_1k_v3_possorted_genome_bam_filterd.bam     View and convert BAM to SAM

Sometimes the line before a command will be commented to explain what the following command does. For example:

# Extract the BAM header and write to header_filted_sam
samtools view -H pbmc_1k_v3_possorted_genome_bam.bam > header_filted_sam 

Quickstart

Determine Which Barcodes are Cell-Associated

samtools view pbmc_1k_v3_possorted_genome_bam_filterd.bam | cut -f 12- | tr "\t" "\n" | grep "CB:Z:" | uniq > cell_barcodes.txt 
# samtools view pbmc_1k_v3_possorted_genome_bam_filterd.bam     View and convert BAM to SAM
# grep 'xf:i:25'                                                Find counted UMIs
# cut -f 12-                                                    Remove first 12 columns SAM file
# tr "\t" "\n"                                                  Replace all tabs with a new line 
# grep 'CB:Z:'                                                  Find lines that start with CB:Z

Use Cell-Associated Barcodes to Extract Feature Counts per Barcode

samtools view pbmc_1k_v3_possorted_genome_bam.bam | grep "xf:i:25" | cut -f 12- | grep "CB:Z:CGAGAAGTCAAGTTGC" | tr "\t" "\n" | grep "GX:Z" |awk "{count[$1]++} END {for(j in count) print count[j], j}" | sort -nr > feature_counts_per_barcode.txt
# samtools view pbmc_1k_v3_possorted_genome_bam_filterd.bam     View and convert BAM to SAM
# grep "xf:i:25"                                                Grep counted reads
# cut -f 12-                                                    Remove first 12 columns SAM file
# grep "CB:Z:CGAGAAGTCAAGTTGC"                                  Print lines with "CB:Z:CGAGAAGTCAAGTTGC". ***Replace this with your barcode of interest***
# tr "\t" "\n"                                                  Replace all tabs with a new line 
# grep "CB:Z:""                                                 Grep lines that start with CB:Z
# awk "{count[$1]++} END {for(j in count) print count[j], j}"   Count each occurrence of a barcode 
# sort -nr                                                      Sort numerically and in reverse order

Full Tutorial

For this tutorial we will use one of the 10x Genomics public datasets. You can download the BAM file and count matrix directly by clicking the file name or using a command line downloading utility (e.g. wget or curl) as demonstrated below.

wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_possorted_genome_bam.bam
wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_filtered_feature_bc_matrix.tar.gz

# Unpack the pbmc_1k_v3_analysis.tar.gz files 
tar -xzvf pbmc_1k_v3_filtered_feature_bc_matrix.tar.gz

Let's start by parsing our BAM files to obtain our count matrix.

Going from the BAM file to the filtered feature barcode matrix requires the following steps:

  1. Identify cell associated barcodes
  2. Determine the number of times a feature is counted for each barcode

Let's get started.

1. Identify Cell-Associated Barcodes

For a UMI to be counted it must meet specific criteria:

  • Have a MAPQ score of 255 (see the STAR manual, section 4.2).
  • Maps to exactly one gene (as shown in the GX tag of the BAM file)
  • Overlaps an exon by at least 50% in a way consistent with annotated splice junctions and strand annotation. Records that align to exons will have an RE:A:E tag.
  • Remove any records with matching UMI and Barcode values that map to different genes.

In Cell Ranger 3, 10x Genomics introduced, xf, a bitwise alignment flag and the bits are as follows:

NameValueDescription
CONF_MAPPED1The read is confidently mapped to the transcriptome.
LOW_SUPPORT_UMI2This read's (BC, UMI, feature) combination was discarded in favor of a different feature with higher read support.
GENE_DISCORDANT4This read pair maps to a discordant pair of genes, and is not treated as a UMI count
UMI_COUNT8This read is representative for the molecule and can be treated as a UMI count
CONF_FEATURE16Confidently assigned feature barcode
FILTERED_TARGET_UMI32This read was removed by targeted UMI filtering.

BAM records that contribute to UMI counting have an xf:i:25 tag (i.e. CONF_MAPPED + UMI_COUNT + CONF_FEATURE).

In this tutorial we will use the version of samtools that is bundled with Cell Ranger. To use this samtools you can run the following command:

source /PATH/TO/CellRanger/sourceme.bash

replacing /PATH/TO/CellRanger/ with the path to your Cell Ranger install.

Next, filter the BAM file so that it only includes records with xf:i:25. This will reduce the size of our BAM file and the time required to sort it.

# Filter BAM file for records with 'xf:i:25' and write that to body_filtered_sam. body_filtered_sam is on the sam format and will not include the BAM header. 
samtools view pbmc_1k_v3_possorted_genome_bam.bam | LC_ALL=C grep "xf:i:25" > body_filtered_sam

# Extract the BAM header and write to header_filted_sam
samtools view -H pbmc_1k_v3_possorted_genome_bam.bam > header_filted_sam 

# Combine the header and extracted records
cat header_filted_sam body_filtered_sam > pbmc_1k_v3_possorted_genome_bam_filterd.sam

# Convert SAM to BAM 
samtools view -b pbmc_1k_v3_possorted_genome_bam_filterd.sam > pbmc_1k_v3_possorted_genome_bam_filterd.bam

The xf flag takes care of:

  • Collapsing UMIs
  • Eliminating chimeric reads
  • Identifying records with a MAPQ score of 255
  • Eliminating reads that map to more than one gene
  • Aligning to an exon by at least 50% in a way consistent with annotated splice junctions and strand annotation.

so we don't have to worry about them. Without the xf tag, several sorting and filtering steps would need to be performed.

Now that we've sorted our BAM such that it only includes records that contribute to UMI counting, let's extract the cell associated barcodes to use in the next step.

samtools view pbmc_1k_v3_possorted_genome_bam_filterd.bam | cut -f 12- | tr "\t" "\n" | grep "CB:Z:" | uniq > cell_barcodes.txt
# cut -f 12-                                                    Remove first 12 columns SAM file
# tr "\t" "\n"                                                  Replace all tabs with a new line 
# grep 'CB:Z:'                                                  Grep lines that start with CB:Z
# uniq                                                          Omit repeated lines

Now we can take a quick look at this file:

# Print first 10 lines of cell_barcodes.txt
head cell_barcodes.txt

Your output should look like this:

CB:Z:GCTGCAGGTTGACGGA-1
CB:Z:CTGCCTATCTTGATTC-1
CB:Z:TCACACCCAGCAGAAC-1
CB:Z:GCGAGAAAGTTGTAGA-1
CB:Z:GGTCTGGTCAGACATC-1
CB:Z:TGAGCGCCACTATGTG-1
CB:Z:GCTGAATCACACGGAA-1
CB:Z:CAACAACTCGATACTG-1
CB:Z:GTAATCGGTTGCGAAG-1
CB:Z:TTACTGTGTTAAGGGC-1

2. Determine the Number of Times a Feature Appears for Each Barcode

Now that we have identified our cell-associated barcodes we can look up the genes in each cell and how many times they are seen in our data. Below we do that using a single barcode from our list of cell associated barcodes. You can replace "CB:Z:CGAGAAGTCAAGTTGC" with your cell barcode of interest.

samtools view pbmc_1k_v3_possorted_genome_bam_filterd.bam | cut -f 12- | grep "CB:Z:CGAGAAGTCAAGTTGC" | tr "\t" "\n" | grep "GX:Z" |awk '{count[$1]++} END {for(j in count) print count[j], j}' | sort -nr > feature_counts_extracted_bam.txt
# samtools view pbmc_1k_v3_possorted_genome_bam.bam              # 
# grep "xf:i:25"                                                 Print lines with xf:i:25
# cut -f 12-                                                     Remove first 12 columns SAM file
# grep "CB:Z:""                                                  Grep lines that start with CB:Z at the beginning of the line
# tr "\t" "\n"                                                   Replace all tabs with a new line 
# grep "GX:Z"                                                    Print lines with GX:Z at the beginning of the line
# awk "{count[$1]++} END {for(j in count) print count[j], j}"    Count occurence of each barcode
# sort -nr`                                                      Sort numerically and in reverse order

Now let's take a look at the file we just created:

# Print first 10 lines of feature_counts_per_barcode.txt
head feature_counts_extracted_bam.txt

The output will look like this:

10763 GX:Z:ENSG00000211592
1018 GX:Z:ENSG00000198938
964 GX:Z:ENSG00000211896
850 GX:Z:ENSG00000198712
780 GX:Z:ENSG00000198804
671 GX:Z:ENSG00000251562
651 GX:Z:ENSG00000156508
615 GX:Z:ENSG00000211897
554 GX:Z:ENSG00000229117
546 GX:Z:ENSG00000198899

Column 1 is counts and column two is the Ensembl Gene ID proceeded by the GX BAM tag and Z tag type.

TagTypeDescription
GXZSemicolon-separated list of gene IDs that are compatible with this alignment. Gene IDs are specified with the gene_id key in the reference GTF attribute column.

Validation

Now let's compare our counts to those found in our CSV file. This can be a bit of a cumbersome task because the filtered feature-barcode matrix is large and in a format that's not easy to manipulate so let's convert this to a CSV that's a bit easier to work with.

# Print line number along with contents of barcodes.tsv.gz and genes.tsv.gz 
zcat filtered_feature_bc_matrix/barcodes.tsv.gz | awk -F "\t" 'BEGIN { OFS = "," }; {print NR,$1}' | sort -t, -k 1b,1 > numbered_barcodes.csv
zcat filtered_feature_bc_matrix/features.tsv.gz | awk -F "\t" 'BEGIN { OFS = "," }; {print NR,$1,$2,$3}' | sort -t, -k 1b,1 > numbered_features.csv

# Skip the header lines and sort matrix.mtx.gz
zcat filtered_feature_bc_matrix/matrix.mtx.gz | tail -n +4 | awk -F " " 'BEGIN { OFS = "," }; {print $1,$2,$3}' | sort -t, -k 1b,1 > feature_sorted_matrix.csv
zcat filtered_feature_bc_matrix/matrix.mtx.gz | tail -n +4 | awk -F " " 'BEGIN { OFS = "," }; {print $1,$2,$3}' | sort -t, -k 2b,2 > barcode_sorted_matrix.csv

# Use join to replace line number with barcodes and genes
join -t, -1 1 -2 1 numbered_features.csv feature_sorted_matrix.csv | cut -d, -f 2,3,4,5,6 | sort -t, -k 4b,4 | join -t, -1 1 -2 4 numbered_barcodes.csv - | cut -d, -f 2,3,4,5,6 > matrix.csv

# Remove temp files
rm -f barcode_sorted_matrix.csv feature_sorted_matrix.csv numbered_barcodes.csv numbered_features.csv

Perfect, now let's take a look at the CSV file:

# Print first 10 lines of final_matrix.csv
head matrix.csv 

The output will look like this:

AAACCCAAGGAGAGTA-1,ENSG00000000938,FGR,Gene Expression,5
AAACCCAAGGAGAGTA-1,ENSG00000001631,KRIT1,Gene Expression,2
AAACCCAAGGAGAGTA-1,ENSG00000002834,LASP1,Gene Expression,1
AAACCCAAGGAGAGTA-1,ENSG00000002933,TMEM176A,Gene Expression,1
AAACCCAAGGAGAGTA-1,ENSG00000003056,M6PR,Gene Expression,1
AAACCCAAGGAGAGTA-1,ENSG00000003402,CFLAR,Gene Expression,4
AAACCCAAGGAGAGTA-1,ENSG00000003756,RBM5,Gene Expression,2
AAACCCAAGGAGAGTA-1,ENSG00000004059,ARF5,Gene Expression,1
AAACCCAAGGAGAGTA-1,ENSG00000004700,RECQL,Gene Expression,1
AAACCCAAGGAGAGTA-1,ENSG00000004961,HCCS,Gene Expression,1
AAACCCAAGGAGAGTA-1,ENSG00000004975,DVL2,Gene Expression,1
AAACCCAAGGAGAGTA-1,ENSG00000005020,SKAP2,Gene Expression,1
AAACCCAAGGAGAGTA-1,ENSG00000005022,SLC25A5,Gene Expression,2
AAACCCAAGGAGAGTA-1,ENSG00000005075,POLR2J,Gene Expression,2
AAACCCAAGGAGAGTA-1,ENSG00000005243,COPZ2,Gene Expression,1

Great! Now we have a simpler format to work with. Now let's pick some genes to look at. Here are some of my favorites:

Gene nameEnsembl ID
NF-kappa-B essential modulator (NEMO)ENSG00000269335
HLA-BENSG00000234745
VEGFAENSG00000112715

To look at these genes, we will create a CSV file with this loop:

# For a string in list perform unix command
for i in ENSG00000269335 ENSG00000234745 ENSG00000112715 
do 
# Format and print in list, merge lines one at a time, and write to gene_ids.csv
    printf '%s\n' $i | paste -sd " " >> gene_ids.csv
done

Then we will loop through the CSV file:

# For a string in gene_ids.csv perfom unix command
for i in $(cat gene_ids.csv)
do
# Open final_matrix.csv and search for lines that include CGAGAAGTCAAGTTGC followed by string in gene_ids.csv
    cat matrix.csv | grep CGAGAAGTCAAGTTGC.*${i}
done

The output will look like this:

CGAGAAGTCAAGTTGC-1,ENSG00000269335,IKBKG,Gene Expression,2
CGAGAAGTCAAGTTGC-1,ENSG00000234745,HLA-B,Gene Expression,108
CGAGAAGTCAAGTTGC-1,ENSG00000112715,VEGFA,Gene Expression,2

Now let's compare this to the counts we extracted in the steps above (i.e counts in feature_counts_per_barcode.txt):

# For a string in gene_ids.csv perfom unix command
for i in $(cat gene_ids.csv)
do
# Open feature_counts_per_barcode.txt and search for string in gene_ids.csv
    cat feature_counts_extracted_bam.txt | grep ${i}
done

The output will look like this:

2 GX:Z:ENSG00000269335
108 GX:Z:ENSG00000234745
2 GX:Z:ENSG00000112715

This gives us the number of times these features are counted in a specific cell-associated barcode.

Conclusion

In this tutorial, we have identified differences between the filtered and unfiltered barcode matrixes, explored the structure of the Filtered Feature Barcode Matrix, and used the xf flag to extract BAM records used for UMI counting. We hope that you have found this useful in exploring the data in your own BAM files.

Related Content:

How to convert 10x BAM files to FASTQ files while preserving the barcode information?

What is the AN tag in the BAM file from cellranger count?

How do I get the read counts for each barcode?

References

PBMCs from a Health Donor (v3), Single Cell Gene Expression Dataset by Cell Ranger 3.0.0, 10x Genomics, (2019, November 19).

Zheng, Grace X.Y., Terry, Jessica M., [...] Bielas, Jason H. (2017). Massively parallel digital transcriptional profiling of single cells. Nature Communications. 8: 1-12, doi: 10.1038, ncomms14049