A large international consortium of geneticists, bioinformaticians and medical researchers has announced the publication of the first ‘end-to-end’ human reference genome sequence.

The completion of this impressive task has broad implications for medicine, evolutionary biology and population genetics, among other fields.

After the initial announcement of the achievement back in 2020, the official new reference genome and a suite of related peer-reviewed papers have been published today in the journal Science.

But wait, wasn’t the human genome already sequenced back around the year 2000? What’s going on here?

What is a genome?

A genome is the complete set of genetic material for an individual or organism.

For humans and all other living things that we know of that genetic material is deoxyribonucleic acid, or DNA. There are some viruses – which aren’t considered living by most biologists – whose genome is made of a similar molecule called ribonucleic acid, or RNA. SARS-CoV-2 is an example of a virus with an RNA genome.

At the molecular level, our DNA is made of two strings of sugars and phosphate groups, joined together with bases. The bases on one strand pair with their partners on the other strand to create the famous double helix, or twisted ladder, shape of DNA.

Overview of DNA structure. Credit: NHS National Genetics and Genomics Education Centre via Wikimedia Commons.

The order of the four bases – adenine (A), cytosine (C), thymine (T) and guanine (G) – forms the genetic code. In other words, the order of these molecules carries the specific instructions our cells need to make the proteins and other molecules that make our bodies work.

When we sequence DNA, it’s like we are ‘reading’ the order of the four bases.

Because the bases on the two strands pair together, we usually measure genome size in terms of base pairs. Altogether, there are about three billion total base pairs in the human genome. That’s a lot of letters to read.

But why should we care about DNA sequences?

A lot of our genome is made up of sequences called genes that our cells need to know in order to make proteins. It’s estimated that there are about 20,000-25,000 protein-coding genes in humans.

Other parts of our genome encode the instructions for making special types of RNA, or form binding sites for proteins that attach to the DNA strand and control how our genes are expressed. Still other regions don’t yet have any known function.

So by understanding the DNA sequences in the genome, we can gain an incredible amount of knowledge about all kinds of biological processes.

DNA is representative of both the unity of all life, and the tiny differences that make us individuals. We humans are all genetically different, but we share enough in common that we can all be understood with reference to a single ‘reference genome’ that lists all the genes and other DNA features found in all humans – with the understanding that our individual DNA sequences will all contain small variations that are unique to us.

It’s kind of like how each cook might put their own spin on a common recipe – that recipe still contains the crucial information that we can compare the variations to. This reference genome is what scientists are usually referring to when we say that ‘the human genome’ has been sequenced.

Why couldn’t the complete human genome be sequenced until now?

We do already have a human reference genome – you weren’t dreaming that. It was produced by the Human Genome Project (HGP), an initiative funded by the US National Institutes of Health (NIH).

When the HGP was deemed complete in 2003, the human reference genome it had produced covered about 99% of the genes in the genome – a very impressive result. But remember, there’s more to a genome than just genes.

Certain regions of the genome couldn’t be accurately sequenced and incorporated into the previous human reference genome. About 8% of the genome was still a mystery, until now.

So, an international group of scientists teamed up to form the Telomere to Telomere (T2T) Consortium and set themselves the goal of sequencing the human genome – even the trickiest parts.

The HGP mostly relied on a technique called BAC-based sequencing, short for ‘bacterial artificial chromosome’. In this technique, samples of human DNA are broken up into smaller pieces and inserted into bacterial cells.

The bacterial colonies grow and create lots of copies of DNA fragment, which can then be extracted for sequencing. With enough bacterial colonies, each carrying a different bit of human DNA, scientists could eventually obtain sequences for most of the genome.

But the DNA from the bacteria has to be chopped into even shorter pieces in order to be sequenced. You have to use a computer to stitch all the little DNA sequence fragments back together in the right order.

This technique worked well for most of the genome. But in certain regions, the DNA sequences are repetitive, making them challenging to sequence and map accurately using this approach.

“Let’s say you’re standing on the beach and you’re looking at the ocean on a very clear day,” says Bastien Llamas, an associate professor studying genomics and epigenetics at the University of Adelaide. He wasn’t involved in the work on the new reference genome, but he’s now working with some members of the T2T Consortium on related projects.

“The ocean is blue, the sky is blue. You take, like, 50 photographs of the landscape, and then when you’re at home, you try and assemble all these photos together, and all you have is patches of blue.”

Think of the bits of blue as an analogy for the repetitive regions of DNA – it all just looks too similar to be confident about how they fit together.

The key technological advance that helped the T2T Consortium achieve their goal was the advent of accurate long-read DNA sequencing.

While short-read sequencing can sequence a few hundred base pairs at a time, newer long-read techniques can sequence tens of thousands of base pairs in one go.

Having longer reads makes it much easier to figure out how the genome fits together, especially for repetitive regions of DNA.

Let’s think back to the 50 different photographs of blue we were trying to fit together to create a picture of the sky and sea.

“If you have a wide lens, you could take one photo of the whole landscape, and that’s it – you have it,” Llamas explains.

Long-read sequencing gives us that more panoramic view of the genome. With the power of this new technique, the T2T Consortium’s new reference genome has added almost 200 million base pairs of previously unknown sequence and identified 99 newly described likely protein-coding genes.

What does the new human reference genome mean for science?

But why do we care about that relatively small percentage of previously inaccessible DNA sequence? Well, some of these repetitive regions have really important functions.

“These parts of the human genome that we haven’t been able to study for 20-plus years are important to our understanding of how the genome works, genetic diseases, and human diversity and evolution,” says Karen Miga, an assistant professor of biomolecular engineering at the University of California Santa Cruz and co-chair of the T2T Consortium.

In our cells, strands of DNA are curled around proteins and arranged into structures called chromosomes. Telomeres are repetitive sequences found at the ends of our chromosomes.

Every time our cells divide, each chromosome has to be copied, and a little bit of the end of the chromosome gets lost. The telomeres exist to be sacrificed in this cell division process, so that we don’t lose the genes that are more important for our cells to function.

As we age and cells divide again and again, our telomeres get shorter and shorter until the cells begin to die and our tissues deteriorate. Understanding the telomere sequences could give us deeper insight into the fundamental process of ageing.

Another example is the centromeres: regions of DNA where the two copies of the chromosome meet and eventually separate during cell division.

“”The centromeres play a critical role in how chromosomes segregate properly during cell division, and we’ve known for some time now that they are misregulated in all kinds of human diseases,” explains Miga.

“For the first time, we can study ‘base-by-base’ the sequences that define the centromere and can start to understand how it works.”

The new genome has also corrected some artificial errors introduced in the assembly of the previous reference.

“It’s exciting,” says Llamas of the new reference genome’s publication. “It’s solving a lot of the issues that we had with the previous reference genome.”

But he points out that there is a lot of work still to be done.

“All the databases that have been created, which contain genomic variations for thousands or hundreds of thousands of individuals, all need a reference [genome],” he explains.

“A particular variable position in the genome has its coordinates that are set compared to the reference genome. And now if you create a new reference with slightly different coordinates, it’s going to change all the information in the databases.”

This is a particular challenge for clinical genomic databases, which focus on the links between DNA variations and disease.

Furthermore, scientists may need to have a second look at existing sequence data that had never been analysed because they were from a part of the genome, like the centromeres, that wasn’t included in the previous reference.

“It’s not impossible, it’s just something that is going to require a lot of time,” Llamas concludes.

Then again, this isn’t the first update to the reference genome since 2003, and it won’t be the last. Members of the T2T Consortium, along with the Human Pangenome Consortium, are already working on a project that will attempt to incorporate more genetic variation from different individuals and populations into a reference genome.

And technically, even though it’s end-to-end, the new reference genome isn’t quite “complete”. Because the sequences that went into the new genome came from a human cell line that only contained X chromosomes, there’s no Y chromosome in the new reference.

But the new reference genome is still a pretty monumental achievement.

“Now we have a Rosetta Stone for looking at complete variation in hundreds of thousands of other genomes going forward,” says T2T Consortium member Evan Eichler, a professor in genome science at the University of Washington and the Howard Hughes Medical Institute.