Blog
Apr 11, 2017

From Pine Cones to Read Clouds: Re-scaffolding the Mega-genome of Sugar Pine

Kariena Dill

Conifers are the cone bearing evergreens that dominate temperate forests around the globe and are the source of most of the world’s lumber. Often referred to as ‘megagenomes’, conifer genomes are generally several orders of magnitude larger than the human genome. Their global impact on both economy and ecology has driven significant investment in conifer genomics projects, yielding the largest genome assemblies accomplished to date.

The colossal scale of conifer genomes is largely due to insertions of transposable elements, making them highly repetitive and particularly challenging to assemble (Nystedt et al. 2013; Stevens et al. 2016). The problem of assembling conifer genomes can be greatly simplified by the use of a gymnosperm megagametophyte, a maternally derived tissue within each seed that contains a haploid genome. A de novo assembly for sugar pine (v1.0) was published in 2016 (Stevens et al.). For this assembly, researchers used genomic DNA derived from a single megagametophyte to construct a set of Illumina paired-end sequencing libraries. The resulting assembly is represented in 202,322 scaffolds with a scaffold NG50 of 247 Kbp. This v1.0 sugar pine assembly is the largest and most contiguous conifer de novo assembly ever published.

In a recent article published in G3, Crepeau et al. sought to create a more contiguous sugar pine reference assembly. They started by isolating genomic DNA from a megagametophyte produced by the same mother tree used for the v1.0 assembly. This sibling haplotype is expected to be ~50% identical to the genome represented in the original assembly. Using the GemCode platform from 10x Genomics, they constructed 5 barcoded sequencing libraries (1.2 ng input gDNA/ library) and sequenced them to a depth of 4.25x raw coverage on standard Illumina platforms (2 libraries on a HiSeq2500 and 3 on a HiSeq4000).

Genome scaffolding was performed with the FragScaff software program (Kitzman et al. 2014) using a BAM file of the barcoded sequencing reads described above and the v1.0 draft assembly. The resulting final assembly contains 84,884 scaffolds covering 25.5 billion base pairs and has a scaffold NG50 of 1.94 Mbp. This v1.5 sugar pine assembly is an 8-fold improvement over the original assembly and was achieved with relatively light sequencing coverage and low amounts of input DNA. Importantly, the authors highlight that only 1% of the high molecular weight DNA extracted for this project was used in preparing the 5 GemCode libraries, leaving 99% of the DNA for subsequent studies.

In California forests, sugar pine survival has been threatened by both fungal and insect pathogens (read more here). The improved v1.5 sugar pine genome assembly will help researchers understand the genetic basis of pathogen resistance and inform breeding strategies for reforestation. More generally, the protocol described in this article demonstrates utility and scalability of 10x barcoded libraries for assembling the immense and largely repetitive genomes of conifers.

Read the full article here.

Additional Resources:

  • de novo Application Note
  • Linked-Read Technical Note

Additional References:

Nystedt B., N. R. Street, A. Wetterbom, A. Zuccolo, Y.-C. Lin, et al., 2013 "The Norway spruce genome sequence and conifer genome evolution." Nature 497.

Stevens, K. A., J. L. Wegrzyn, A. Zimin, D. Puiu, M. Crepeau, et al., 2016 "Sequence of the sugar pine megagenome" Genetics 204.