Oct 6, 2016

De novo Assembly of the Sitka Spruce Chloroplast Genome using Linked-Reads

Shauna Clark

The use of standard next generation sequencing (NGS) short reads has generally not been sufficient for de novo assembly applications.  However, the introduction of 10x Genomics’ GemCode technology now enables researchers to obtain long range information using short read NGS instruments.  In a paper recently published in PLOS ONE, entitled "Assembly of the Complete Sitka Spruce Chloroplast Genome Using 10X Genomics’ GemCode Sequencing Data" researchers used GemCode’s novel Linked-Read data type for de novo assembly of a complete chloroplast genome.

Researchers saw the potential in the Linked-Read approach, citing that "the GemCode platform improves on existing short read sequencing technology, and enables pooling of paired-end sequences by index, thus grouping together sequences that arise from the same original piece of DNA."  Specifically, the GemCode instrument partitions long DNA molecules (50kb or more) and tags each resulting short read sequencing library fragment derived from those molecules with the same barcode.  After sequencing, reads with the same 10x barcode can be grouped back together and identified as originating from the same long DNA molecule.  The resulting Linked-Read data type can be used to obtain long range information from short read sequencers.

To begin, total DNA was isolated from the Sitka spruce needles and used for library preparation on the GemCode instrument.  The GemCode library was then sequenced using paired-end (2x125bp) reads on Illumina's HiSeq instrument.  Before starting the assembly, researchers tested their theory that the higher number of chloroplast genomes per cell compared to the nuclear genome corresponded to a higher abundance 10x barcode reads by subsampling the sequencing reads and binning them by 10x barcode frequency into 1,000 (top 33.8% of reads), 3,000 (top 2.4%) and 5,000 (top 0.8%) reads, or 1k, 3k and 5k bins.  The reads in the 5k bin were then aligned to the closely related white spruce chloroplast genome resulting in 99.9% coverage of the reference; thus, confirming their theory.

The researchers then moved forward with the assembly, starting by assembling the 5k bin reads using ABySS.  The resulting contigs were then scaffolded using LINKS in to a single 122,544 bp scaffold with 74 gaps.  Using the reads from the 3k bin and Sealer software, researchers were able to close 68 of the 74 gaps.  The remaining gaps were closed creating by creating a Bloom filter (BioBloomTools utility) based on sequences from local BLAST alignments between the scaffold gaps plus 500bp flanking sequence and the closely related white spruce genome.  The identified unaligned regions plus 24bp flanking sequence were then used for filtering reads in the 3k bin that could span the gap region on the Sitka spruce chloroplast scaffold.  Using these filtered reads in Sealer simplified the de Bruijn graph to include k-mers that were from only a local area of the genome and resulted in closing 4 of the 6 remaining gaps.  A similar approach was used to close the remaining 2 gaps by filtering reads in the 1k bin and the entire read set.

Further BLAST alignments of the ends of the Sitka spruce chloroplast scaffold and the white spruce chloroplast reference revealed missing sequence at the 5’ and 3’ ends, 40bp and 2kb respectively.  Using a similar approach to the gap closing, read filters were created and the filtered reads were used by Konnector to fill in the missing sequences at the ends of the scaffold. The draft genome was then polished using the Genome Analysis Toolkit, which confirmed all the existing bases.  The final, complete Sitka spruce chloroplast genome is 124,049 bp long, with 38.7% GC content.

Comparison of the sequence alignments of the de novo Sitka spruce chloroplast genome with related species showed 99.0% sequence identity and nucleotide frequencies consistent with the Norway and white spruces, while phylogenetic analysis showed the Sitka, Norway and white spruce branched together, as expected.  Gene annotation of the Sitka spruce chloroplast reveals that all 114 genes are found in the same copy number and in the same order as is observed in white and Norway spruce, including 74 coding genes, 4 ribosomal RNA (rRNA) and 36 transfer RNA (tRNA) genes.  One note of interest is that an inverted repeat in the Sitka spruce chloroplast genome has 3 nucleotide mismatches, which is unusual compared to the white (1 mismatch) and Norway (0 mismatches).

By exploiting the abundant copy number of chloroplast genomes per cell and the corresponding increase in the 10x GemCode barcodes associated with these reads, and using the related white spruce admix (PG29 genotype) chloroplast genome for scaffolding and read filtering, researchers were able to assemble the complete chloroplast genome of the (Picea sitchensis) Sitka spruce tree.

The authors concluded, "By using the GemCode index sequences associated with each sequencing pair, we were able to take smaller samples of the entire read set, and the resulting targeted assemblies resulted in better contiguity and a less complicated de Bruijn graph structure, when compared to both the full read set and the lower frequency index bins. Because many of the steps in our assembly pipeline used graphs, reducing the number of reads while still maintaining high base coverage of the chloroplast genome decreased the complexity and number of paths represented in those k-mer graphs. Therefore, taking advantage of the information available to us from the index sequences led to better results in the initial assembly, gap filling and end extension steps of our Sitka spruce chloroplast genome assembly."

Since this data was generated using the original GemCode instrument and chemistry (~700K barcodes), the researchers also suggested that using the updated Chromium Controller and chemistry (~4M barcodes) could make it potentially easier, to characterize a specific locus or target sequence, such as that of plasmid and organelle genomes, by reducing the possible barcode redundancy and increasing the read resolution.

Read the full journal article->

Learn more about the Chromium de novo Assembly Solution:

Download Application Note

Read more about the 10x Supernova Assembly Software

Download de novo datasets

The image used in this blog is from:

Lauren Coombe, René L. Warren, et al. "Assembly of the Complete Sitka Spruce Chloroplast Genome Using 10X Genomics’ GemCode Sequencing Data" PLoS One 2016:11(9):e0163059. doi: 10.1371/journal.pone.0163059