#4: Uncovering the human genome

This week: An award from the Royal Statistical Society, and major advances since the Human Genome Project.

Jul 15, 2022

This is my fourth post of Scientific Discovery, a weekly newsletter where I’ll share great new scientific research that you may have missed.

An award

Yesterday, the Royal Statistical Society held their annual awards ceremony for statistical excellence in data visualisation, journalism & more.

At the ceremony, I received a commendation for this article I wrote in December on the safety of Covid vaccines during pregnancy.1 I am really honoured by this, and to have been at the ceremony among so many great people who have been doing so much incredible work during the pandemic.

My motivation for writing that article was the same as the motivation I have for all my science writing today. Here was an issue where the evidence was so abundantly clear, but the messaging had failed the public. Instead of highlighting the safety and importance of vaccination, people were under the impression that vaccines caused infertility and miscarriage.

It is a massive understatement to say that those are not true. As I explain in the article, the evidence shows the opposite: not only are the vaccines free of these risks, they actually reduce them – by protecting women from Covid, which is especially harmful during pregnancy. Since I wrote the article, the evidence has just grown and grown.

I cannot say this enough – please encourage and support the women in your life in getting vaccinated and boosted against Covid-19. It is very hard to see people suffer for something that is so avoidable.

In this week’s newsletter, I’ve chosen to focus on two major advances since the Human Genome Project. It is hard to grasp how much of human genetics research today relies upon a ‘reference’ human genome, which was first constructed during the project. These studies are a huge advance to that reference genome, and by doing that, they unlock whole new possibilities for research.

But no more teasers, let’s get into it!

#1: We now know what the entire human genome looks like

Paper: The complete sequence of a human genome (Nurk et al., 2022)

Most people have heard of the Human Genome Project, the great initiative that launched in 1990 and aimed to determine the entire sequence that makes us human – the position of our genes, and the code within and around them.

But few people know that the effort is still ongoing. Although the project was declared ‘essentially complete’ in April 2003, it had gaps which are still being resolved.

Here’s a little background to explain what was missing.

Our DNA is pretty long: across all 23 chromosomes, human DNA is composed of more than 3 billion bases (of A, T, C or G). But we don’t have the technology to read this code from start to finish.

Instead, you can break up the genome into small fragments and read them separately. When you do this with many copies of the DNA, some of those fragments will overlap, which means you can align them and put the whole picture back together.

Fig. 1 — A diagram that shows how fragments of DNA are read and assembled together to determine the code from start to end. Each fragment that is read by sequencing technology is called a ‘read’. Some reads overlap, which means you can align them to each other to determine a length of continuous code (a ‘contig’). Overlaps between contigs help to assemble a scaffold, and then a chromosome. *Guo et al. (2017)*

Unfortunately, this isn’t always easy.

In some parts of the genome, lengths of code repeat many times over, which makes it hard to know which fragments should be aligned where and how long their lengths are. If you wanted to determine their sequences, you’d need technology that can read very long fragments with high accuracy, and we didn’t have these until recently.

Such repetitive lengths of code are more common at the ends of chromosomes (at the ‘telomeres’) and regions in the middle (at the ‘centromeres’). And although these regions don’t typically have any genes, they’re quite important.

The telomeres on the ends of chromosomes protect the rest from being degraded when cells divide. And the centromeres of pairs of chromosomes attach to each other and pull them apart during cell division. When parts of these regions are mutated, they can prevent chromosomes from functioning properly, which can lead to various diseases and cancers – so we really want to know what all of their code looks like.

Altogether, the problems with sequencing them meant that around 8% of the genome was hidden to us until recently.

Several years ago, a big group of scientists came together to form the ‘Telomere-to-Telomere Consortium’ to uncover all the missing parts. They used multiple new technologies that could read very long fragments of DNA, aligned those fragments, and published the results in their study earlier this year.

Visual showing each of the 23 chromosomes. The parts that were uncovered by this study are highlighted in red. *Science (2022)*

There’s another interesting thing about what they did: they got their DNA samples from cells in ‘hydatidiform moles’. But… what in the world are these?

They’re moles that develop in a rare form of pregnancy, after a sperm fuses with an egg that is missing its DNA. The sperm’s DNA then duplicates, creating cells that have only chromosomes from the father; then those chromosomes are duplicated, forming 46 chromosomes, all from the father.

The condition is usually benign and treatable, and incidentally cells from it have been pretty useful for research. This is because fragments of DNA from these cells are easier to align as they must have come from chromosomes from the same parent, i.e. the same version of the chromosome.

And so, with their study, we have it: the whole genome! The entire code it takes to create a human. Well, half of the time. You’ll notice in the figure that there’s no Y chromosome – that has a lot more repeat sequences, which makes it especially difficult to sequence.

But let’s move on. While scientists are closing that conclusion, there are also other missing pieces in the Human Genome Project.

#2: Multiple reference genomes – the ‘pangenome’

Paper: A Draft Human Pangenome Reference (Liao et al., 2022)

A lot of research in human genetics today relies on a ‘reference genome’, as I mentioned before.

Having a reference is incredibly valuable. When fragments of DNA are sequenced, researchers can quickly compare them to the reference to find out where those fragments lie among the 3 billion base pairs of DNA.

But there is a problem: the larger structure of the genome varies between people.

Some people have long segments (potentially containing many genes) that are duplicated, rearranged or absent. This is common. In 2020, researchers found that almost 4% of people had structural changes that were longer than a million base pairs.

These large structural differences are more likely to have big effects. They estimated that 0.13% of people had structural changes that we already know increase the risk of some important diseases. (These are mutations that doctors are recommended to inform their patient about if they have had their DNA tested for anything else.)

Some structural changes are pretty complicated. They can be long segments that have been rearranged in multiple ways multiple times in human history. To help you picture this, I found this figure below of the different rearrangements people might have on just a small section of their chromosome 17, where lots of research has focused so far.

An example of common versions of a small section of chromosome 17 that different people may have. Below is a key/legend to the figure, showing that the coloured blocks represent the position of genes. Notice how their order may vary. Also, some people have several copies of the same genes, while others don’t. *Usher and McCaroll (2015)*

Despite these big structural differences between people, the reference most researchers use is a single version of the human genome.

What this means is that when researchers try to find structural differences by comparing fragments from a sample to the reference genome, they might mistake where those fragments actually are and what surrounds them. When researchers look at fragments that are shorter, or have been less measured, it is harder to align them together – and infer what they might look like at a broader level.

One way to help them is to have multiple reference genomes they can compare sequences to. This new study provides a way for them to do that.

Rather than having many entire linear genome references, they created a graph with nodes and edges to represent the different possibilities. Below is an example from their paper, of what a very small section of that could look like:

And below is an example of what a larger and more complicated section looks like. The RHD and RHCE genes code for the Rh protein in blood cells. We commonly describe some differences in these genes as being + or - along with the ABO system (so, together, you might have O+ blood like me).

With graphs like these to show multiple genomes, 47 in this study, they show a big boost in accuracy in inferring smaller structural changes in new samples.

This unlocks lots of new possibilities. As sequencing and genotyping have become cheaper, we’ll be seeing lots more research into structural changes in DNA and the effects they might have, and this will help scientists do that.

Scientific Discovery

Discussion about this post

Ready for more?