How DNA will solve our data storage problem

In the summer of 1971 Frank Zappa was playing to a packed audience inside Switzerland’s Montreux Casino when a fan threw a flare and set the room ablaze. Zappa, wielding his Gibson guitar like an axe, broke the casino’s windows, and two thousand screaming teenagers poured out.  Watching from their hotel on the shores of Lake Geneva, members of the band Deep Purple saw the flames. They captured the moment with the song ‘Smoke on the Water’, etching it forever into the annals of the Montreux Jazz Festival. In 2013 it also became part of the first audiovisual archive in the UNESCO Memory of the World Register.


Now ‘Smoke on the Water’ is making history again. This September, it was one of the first items from the Memory Of the World archive to be stored in the form of DNA and then played back with 100% accuracy. The project was a joint effort between the University of Washington, Microsoft and Twist Bioscience, a San Francisco-based DNA manufacturing company.

The demonstration was billed as a ‘proof of principle’ – which is shorthand for successful but too expensive to be practical.  At least for now.

Many pundits predict it’s just a matter of time till DNA pips magnetic tape as the ultimate way to store data.  It’s compact, efficient and resilient. After all, it  has been tweaked over billions of years into the perfect repository for genetic information. It will never become obsolete, because as long as there is life on Earth, we will be interested in decoding DNA.  “Nature has optimised the format,” says Twist Bioscience’s chief technology officer Bill Peck.

Players like Microsoft, IBM and Intel are showing signs of interest. In April, they joined other industry, academic and government experts at an invitation-only workshop (cosponsored by the U.S. Intelligence Advanced Research Projects Activity (IARPA)) to discuss the practical potential for DNA to solve humanity’s looming data storage crisis.

It’s a big problem that’s getting bigger by the minute. According to a 2016 IBM Marketing Cloud report, 90% of the data that exists today was created in just the past two years. Every day, we generate another 2.5 quintillion (2.5 × 1018) bytes of information. It pours in from high definition video and photos, Big Data from particle physics, genomic sequencing, space probes, satellites, and remote sensing; from think tanks, covert surveillance operations, and Internet tracking algorithms.

Every day, we generate another 2.5 quintillion bytes of information.

Right now all those bits and bytes flow into gigantic server farms, onto spinning hard drives or reels of state-of-the-art magnetic tape. These physical substrates occupy a lot of space.

Compare this to DNA. The entire human genome, a code of three billion DNA base pairs, or in data speak, 3,000 megabytes, fits into a package that is invisible to the naked eye – the cell’s nucleus. A gram of DNA — the size of a drop of water on your fingertip — can store at least the equivalent of 233 computer hard drives weighing more than 150 kilograms. To store the all the genetic information in a human body — 150 zettabytes — on tape or hard drives, you’d need a facility covering thousands, if not millions of square feet.

And then there’s durability. Of the current storage contenders, magnetic tape has the best lifespan, at about 10-20 years. Hard drives, CDs, DVDs and flash drives are less reliable, often failing within five to ten years. DNA has proven that it can survive thousands of years unscathed. In 2013, for example, the genome of an early horse relative was reconstructed from DNA from a 700,000-year-old bone fragment found in the Alaskan permafrost.  

So if it’s kept reasonably cool and dry — say, stashed on a shelf in the Svalbard global seed vault near the North Pole — a DNA data archive could last for tens of thousands of years with no need for maintenance. 

So the DNA copy of ‘Smoke on the Water’ will last a long time, but how did the scientists turn a song into a molecule in the first place?” First, the digital music file was translated from a series of 1s and 0s into the letters of the DNA alphabet, the bases A,C, T and G — for example 00 for A, 01 for C, 10 for T and 11 for G. Then the sequences of letters were assembled into short DNA phrases with indexing information added to keep it all in the right order. Using these coding sequences, the DNA was manufactured letter by letter with chemical reactions, and then stored in a test tube. 

{%recommended 5417%}

To retrieve the information, the DNA was run through a sequencing machine to read the exact order of the DNA bases. It was then decoded to produce the original binary data. Finally, the musical file was played back error-free to an audience of Montreux Jazz fans last September 29th  in Lausanne, Switzerland. 

‘Smoke on the Water’ is not the first piece of digital information stored as DNA. In 2012 and 2013, separate teams from Harvard, led by George Church, and the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI), led by Ewan Birney and Nick Goldman, independently stored digital data in DNA. The Harvard sample contained the draft of a 50,000-word book on synthetic biology. The European sample contained a colour image, Shakespeare’s 154 sonnets, an excerpt from Martin Luther King’s “I have a dream” speech and the classic 1953 paper on DNA structure by Watson and Crick. 

Since those two seminal studies, the cost has come down significantly, particularly for DNA sequencing. Synthesis still has some catching up to do. Right now it costs 10 cents per letter to synthesise DNA (three if you’re buying in bulk). Twist Bioscience CEO Emily LeProust estimates that will have to fall to 0.001 cents per letter before DNA can realistically compete with magnetic tape for long-term storage. A big infusion of cash and a lucrative market outlook might provide the needed impetus. 

The second barrier is technical: DNA synthesis and sequencing techniques can each introduce certain types of errors, and the code that translates the 1s and 0s into DNA letters needs to be crafted so as to eliminate these. 

Computer scientists have caught on and joined the fray. The annual IEEE International Symposium on Information Theory (a major coders’ convention) now has a session specifically dedicated to coding for DNA storage. 

In April 2016, a team of researchers at Microsoft and the University of Washington stored a record 200 megabytes of data — a music video of the band OK Go, the Universal Declaration of Human Rights in more than 100 languages, the top 100 books of Project Gutenberg and the Crop Trust’s seed database — on DNA synthesised at Twist Bioscience. Their encoding approach employed common error correction schemes used in computing. They also devised a way to identify and sequence specific pieces of information without the having to sequence the entire record.

“We’re using something we know from computers – how to correct memory errors – and applying that back to nature,” said University of Washington professor Luis Ceze.

In March 2017, Yaniv Erlich from Columbia University and Dina Zielinski from the New York Genome Centre coded six data files data using a new algorithm that was able to encode significantly more data per nucleotide than previous methods, and still returned the original files with 100% accuracy. Their “DNA Fountain” technique customised an algorithm for streaming video on smartphones and resulted in a record 215 petabytes (215 million gigabytes) per gram of DNA. At that density, all the data ever recorded by humans would fit in a container about the size of two pickup trucks.

nearly half of all films made before 1951 have been lost because they were stored on celluloid.

Because writing and reading DNA is still relatively slow, early applications will be archival. But there are plenty of candidates for that, including scientific Big Data, legal and regulatory records, and archives like the UNESCO Memory of the World. Microsoft Research says it is planning to build a proto-commercial DNA storage system within three years. Technicolor, the global media and entertainment tech company,  is funding research in Church’s group at Harvard with archiving in mind; nearly half of all films made before 1951 have been lost because they were stored on celluloid.

It’s not far-fetched to imagine all-in-one DNA data systems, in which the binary data are fed in at one end, synthesised into DNA and stored, then extracted, sequenced and sent out the other end as binary data once again.  “We are working on architectures that integrate the synthesiser, the actual “library” and the reader/sequencer, with the goal of developing a complete system,” says Ceze.

Other researchers are finding ways to keep the DNA stable as long as possible. Robert Grass, a scientist in ETH Zurich’s functional materials laboratory, is working on a method to encapsulate DNA in minuscule silica beads. “Similar to fossilised bones, we wanted to protect the information-bearing DNA with a synthetic ‘fossil’ shell,” he says.  To test the durability of the beads, they heated them to about 70 degrees Celsius for one week, the equivalent of keeping it for 2,000 years at about 10 degrees. 

Which brings us back to the music. Keeping important archives like the UNESCO Memory of the World in a format that could be stashed away for a couple thousand years or more, even if it is relatively expensive in the short term, sounds like a good idea. “The UNESCO archive provides the perfect use-case for testing our approach,” Ceze says. 

When Deep Purple wrote “we’ll never forget / smoke on the water, fire in the sky”, they may have been more right than they knew.

Please login to favourite this article.