Completing the incomplete human genome: a 20-year quest

Missing puzzles of the human genome

The first draft of the reference human genome was released more than 20 years ago (Lander et al., 2001; Venter et al., 2001). Since its completion, the reference human genome has undergone major rounds of amendments to incorporate newly assembled sequences and correct alignment errors. The latest patch, GRCh38.p13/hg38, represents the most up-to-date human genome assembly.

        However, GRCh38.p13/hg38 is far from complete – as much as 8% of it, totaling 151 megabase pair, remains uncharacterized (Nurk et al., 2021). The missing sequences, known as gaps, are highly repetitive sequences distributed throughout the genome. Because of their repetitive nature and the computational challenges in mapping short reads to repetitive sequences (Goerner-Potvin and Bourque, 2018), gaps are conventionally omitted from downstream analyses.

        Gaps are predominately tandem arrays in centromeric and pericentromeric regions, complex repeats in telomeres and subtelomeres, and arms of acrocentric chromosomes. The sequences can be multi-megabase long. For example, the entire p-arms (short arms) of the acrocentric chromosomes 13, 14, 15, 21 and 22 are missing and represented by stretches of unknown bases (‘N’s) (Nurk et al., 2021). As regions including the centromere and telomere are known to participate in fundamental biological processes, the inability to resolve the underlying sequences of gaps preclude a comprehensive picture of the regulatory landscape of the human genome.

Long-read sequencing comes to the rescue

Recent advances in long-read sequencing technologies have offered the scientific community exciting opportunities in closing the gaps in the reference human genome.

        Pacific Biosciences single-molecule real-time (SMRT) sequencing and Oxford Nanopore Technologies (ONT) sequencing are two emerging platforms in the field (Logsdon et al., 2020). Whereas PacBio SMRT sequencing profiles sequences by utilizing the fluorescent signals emitted from newly incorporated nucleotides as the readout, ONT sequencing decodes the identity of individual base by noting the changes of electric current as it transverses through the nanopore (Figure 1). With a read-length of more than 100 kilobases, both platforms have significantly increased the length of mappable continuous sequences (contigs) generated per run, which in turn fostered communal efforts in demystifying the gaps.

        PacBio SMRT sequencing and ONT sequencing have already resolved some of the gaps in the human genome. In 2015, Chaisson and colleagues employed PacBio SMRT sequencing to resolve the complete sequence of numerous euchromatic gaps in the human genome (Chaisson et al., 2015). Later in 2018, Jain et al. performed de novo assembly of human genome using ONT sequencing reads (Jain et al., 2018). These seminal studies have provided a proof-of-concept on the feasibility and utility of long-read sequencing technologies in completing the reference human genome.

Figure 1. PacBio single-molecule real-time (SMRT) sequencing and Oxford Nanopore Technologies (ONT) have characteristic sequencing chemistries and sequence detection approaches. A: adenine; C: cytosine; G: guanine; T: thymine; dNTP: deoxynucleoside triphosphate. Adapted from Logsdon et al., 2022.

Charting the uncharted

Aiming to create a seamless human genome, the Telomere-to-Telomere (T2T) Consortium have combined the merits of both PacBio SMRT sequencing and ONT sequencing to chart the entire genetic blueprint using CHM13hTERT as a model cell line. CHM13hTERT is derived from a complete hydatidiform mole (CHM) and transformed with the human telomerase reverse transcriptase gene. Having a duplicated paternal genome (46, XX) and no observable chromosomal abnormalities, the cell line serves as an ideal, single haplotype model for genome assembly.

        Since the launch of the project, the T2T Consortium has successfully filled in the gaps of chromosome X (Miga et al., 2020), chromosome 8 (Logsdon et al., 2021), and more recently, the entire haplotype of CHM13hTERT cells (Nurk et al., 2021). In addition to the gaps, the consortium has also introduced 200 million new bases – roughly the length of the entire chromosome 3, of which 75-90% are repetitive sequences (Nurk et al., 2021). Using the gapless T2T-CHM13 genome assembly, researchers have already started investigating the epigenetic signatures and genetic variations of repetitive elements within gaps (Altemose et al., 2021; Gershman et al., 2021; Hoyt et al., 2021).

        Moving forward, the T2T Consortium has also set out to sequence the chromosome Y from HG002 (GM24835), a diploid male lymphoblast cell line with normal karyotype (46, XY). Capitalizing on the success of CHM13hTERT, the consortium is also working actively with the Human Pangenome Reference Consortium (HPRC) to generate haplotype-phased genomes from more than 350 individuals representing different ancestries (Figure 2). It is envisaged that the generation of a reference pangenome that captures the diversity of human genetic variation will enable less biased and better-informed genomic and epigenomic studies (Miga and Wang, 2021).

Figure 2. The Human Pangenome Reference Consortium (HPRC) is a team-oriented initiative that solicits multi-disciplinary collaborations. Adopted from Miga and Wang, 2021.

More questions than answers

Work by the T2T Consortium has enabled interrogation of repetitive sequences at previously difficult-to-probe regions at an unprecedented resolution, opening up new avenues of research into their biological implications. Naturally, as with many groundbreaking discoveries, the report of the first gapless human genome is accompanied with a list of follow-up questions.

        The T2T-CHM13 assembly harbours a compendium of 62 novel repeat classes (Hoyt et al., 2021), greatly expanding the atlas of repetitive sequences in the human genome (Logsdon et al., 2021; Miga et al., 2020; Nurk et al., 2021) (Figure 3). A novel class, coined as composite repeat, consists of tandem arrays of three or more repetitive sequences (Hoyt et al., 2021). For instance, the TELO_Comp composite repeat comprises of three 3 kilobase pair-long composites (TELO-A, -B and –C subunits), each containing multiple transposable elements (TEs) (Figure 4) (Hoyt et al., 2021). Notably, most composite repeats reside within the same chromosome and about half of them (8 out of 19) overlap with protein-coding annotations (Hoyt et al., 2021), prompting questions into their evolutionary trajectories. Furthermore, the biological significance of the highly-structured organization of TEs in composite repeat awaits characterization. As a starting point, integrating existing precision nuclear run-on sequencing (PRO-seq) and ONT sequencing data will shed light into the transcriptional regulation and epigenetic signatures of composite repeats.

Figure 3. T2T-CHM13 presents an expanded catalogue of repetitive elements in the human genome. ERV: endogenous retrovirus; LINE: long interspersed nuclear element; SINE: short interspersed nuclear element; SVA: SINE/VNTR/Alu element. Adopted from Hoyt et al., 2022.

Figure 4. TELO_Comp belongs to the newly defined class of composite repeat, comprising of highly-ordered tandem arrays of TELO-A, -B and –C subunits. Adopted from Hoyt et al., 2022

        In their companion paper, Hoyt et al. took advantage of the newly assembled T2T-CHM13 to investigate the transcriptional changes of the centromeres during cell cycle progression (Hoyt et al., 2021). In human, centromeres are defined by alpha satellites, an AT-rich repeat family composed of ~171 base pair monomer, which serves as the basic unit of higher-order-repeats (HORs) (Altemose et al., 2021) (Figure 5) . HORs that are bound by the centromeric-specific histone protein CENP-A are referred to as the active HORs that denote the position of kinetochore assembly during cell division (Altemose et al., 2021). Using PRO-seq, Hoyt and colleagues observed transcription of the active HORs but not the other HORs across all phases of the cell cycle (Hoyt et al., 2021). Interestingly, the active HORs exhibit a hypomethylation pattern, which is conserved from developing to terminally differentiated cells (Hoyt et al., 2021). The findings point to a possible interplay between transcription of active HORs, kinetochore assembly and proper distribution of genetic materials to daughter cells during mitosis. Further mechanistic and functional studies are needed to disentangle the intricate relationship, if any, between the three biological processes. 

Figure 5. The human centromere is defined by alpha satellites, flanked by pericentromeric satellite repeat families including Human satellites 1-3, beta and other satellites. α-Sat: alpha satellite; β-Sat: beta satellite; HSat1-3: Human satellites 1-3. Adopted from Altemose et al., 2022.

Outlook

The T2T Consortium is an ambitious endeavor that has propelled a paradigm shift in genome assemblies using long-read sequencing technologies. The celebrated release of the truly complete reference human genome has not only paved the path to study the biology of the once “dark matters” of the human genome, but also extending the legacies of the Human Genome Project in transforming genomic research. After all this time, we are finally one step closer to the ultimate goal of the Human Genome Project – to understand how each base in the human genome influences health and diseases.

Bibliography

Altemose, N., Logsdon, G.A., Bzikadze, A.V., Sidhwani, P., Langley, S.A., Caldas, G.V., Hoyt, S.J., Uralsky, L., Ryabov, F.D., and Shew, C.J. (2022). Complete genomic and epigenetic maps of human centromeres. Science 376, 6588.

Chaisson, M.J., Huddleston, J., Dennis, M.Y., Sudmant, P.H., Malig, M., Hormozdiari, F., Antonacci, F., Surti, U., Sandstrom, R., and Boitano, M. (2015). Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608-611.

Gershman, A., Sauria, M.E., Hook, P.W., Hoyt, S.J., Razaghi, R., Koren, S., Altemose, N., Caldas, G.V., Vollger, M.R., and Logsdon, G.A. (2022). Epigenetic patterns in a complete human genome. Science 376, 6588.

Goerner-Potvin, P., and Bourque, G. (2018). Computational tools to unmask transposable elements. Nat Rev Genet 19, 688-704.

Hoyt, S.J., Storer, J.M., Hartley, G.A., Grady, P.G., Gershman, A., de Lima, L.G., Limouse, C., Halabian, R., Wojenski, L., and Rodriguez, M. (2022). From telomere to telomere: the transcriptional and epigenetic state of human repeat elements. Science 376, 6588.

Jain, M., Koren, S., Miga, K.H., Quick, J., Rand, A.C., Sasani, T.A., Tyson, J.R., Beggs, A.D., Dilthey, A.T., and Fiddes, I.T. (2018). Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotech 36, 338-345.

Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., and FitzHugh, W. (2001). Initial sequencing and analysis of the human genome. Nature 409, 860–921.

Logsdon, G.A., Vollger, M.R., and Eichler, E.E. (2020). Long-read human genome sequencing and its applications. Nat Rev Genet 21, 597-614.

Logsdon, G.A., Vollger, M.R., Hsieh, P., Mao, Y., Liskovykh, M.A., Koren, S., Nurk, S., Mercuri, L., Dishuck, P.C., and Rhie, A. (2021). The structure, function and evolution of a complete human chromosome 8. Nature 593, 101-107.

Miga, K.H., Koren, S., Rhie, A., Vollger, M.R., Gershman, A., Bzikadze, A., Brooks, S., Howe, E., Porubsky, D., and Logsdon, G.A. (2020). Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79-84.

Miga, K.H., and Wang, T. (2021). The Need for a Human Pangenome Reference Sequence. Annu Rev Genom Hum Genet 22, 81-102.

Nurk, S., Koren, S., Rhie, A., Rautiainen, M., Bzikadze, A.V., Mikheenko, A., Vollger, M.R., Altemose, N., Uralsky, L., and Gershman, A. (2022). The complete sequence of a human genome. Science 376, 6588.

Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., and Holt, R.A. (2001). The sequence of the human genome. Science 291, 1304-1351.

Scroll to Top