22 June 2021

How do we avert our impending data storage crisis?

 

Translation software enables efficient DNA data storage



In support of a major collaborative project to store massive amounts of data in DNA molecules, a Los Alamos National Laboratory–led team has developed a key enabling technology that translates digital binary files into the four-letter genetic alphabet needed for molecular storage. 

“Our software, the Adaptive DNA Storage Codec (ADS Codex), translates data files from what a computer understands into what biology understands,” said Latchesar Ionkov, a computer scientist at Los Alamos and principal investigator on the project. “It’s like translating from English to Chinese, only harder.”

DNA Data Storage - The Solution to Data Storage Shortage

The work is key part of the Intelligence Advanced Research Projects Activity (IARPA) Molecular Information Storage (MIST) program to bring cheaper, bigger, longer-lasting storage to big-data operations in government and the private sector. The short-term goal of MIST is to write 1 terabyte—a trillion bytes—and read 10 terabytes within 24 hours for $1,000. Other teams are refining the writing (DNA synthesis) and retrieval (DNA sequencing) components of the initiative, while Los Alamos is working on coding and decoding.

“DNA offers a promising solution compared to tape, the prevailing method of cold storage, which is a technology dating to 1951,” said Bradley Settlemyer, a storage systems researcher and systems programmer specializing in high-performance computing at Los Alamos. “DNA storage could disrupt the way we think about archival storage, because the data retention is so long and the data density so high. You could store all of YouTube in your refrigerator, instead of in acres and acres of data centers. But researchers first have to clear a few daunting technological hurdles related to integrating different technologies.”

Data Storage in DNA techniques explained

Not Lost in Translation

Compared to the traditional long-term storage method that uses pizza-sized reels of magnetic tape, DNA storage is potentially less expensive, far more physically compact, more energy efficient, and longer lasting—DNA survives for hundreds of years and doesn’t require maintenance. Files stored in DNA also can be very easily copied for negligible cost. 

DNA’s storage density is staggering. Consider this: humanity will generate an estimated 33 zettabytes by 2025—that’s 3.3 followed by 22 zeroes. All that information would fit into a ping pong ball, with room to spare. The Library of Congress has about 74 terabytes, or 74 million million bytes, of information—6,000 such libraries would fit in a DNA archive the size of a poppy seed. Facebook’s 300 petabytes (300,000 terabytes) could be stored in a half poppy seed. 

DNA Data Storage is the Future!

Encoding a binary file into a molecule is done by DNA synthesis. A fairly well understood technology, synthesis organizes the building blocks of DNA into various arrangements, which are indicated by sequences of the letters A, C, G, and T. They are the basis of all DNA code, providing the instructions for building every living thing on earth. 

The Los Alamos team’s ADS Codex tells exactly how to translate the binary data—all 0s and 1s—into sequences of four letter-combinations of A, C, G, and T. The Codex also handles the decoding back into binary. DNA can be synthesized by several methods, and ADS Codex can accommodate them all. The Los Alamos team has completed a version 1.0 of ADS Codex and in November 2021 plans to use it to evaluate the storage and retrieval systems developed by the other MIST teams. 

Unfortunately, DNA synthesis sometimes makes mistakes in the coding, so ADS Codex addresses two big obstacles to creating DNA data files. 

First, compared to traditional digital systems, the error rates while writing to molecular storage are very high, so the team had to figure out new strategies for error correction. Second, errors in DNA storage arise from a different source than they do in the digital world, making the errors trickier to correct. 

“On a digital hard disk, binary errors occur when a 0 flips to a 1, or vice versa, but with DNA, you have more problems that come from insertion and deletion errors,” Ionkov said. “You’re writing A, C, G, and T, but sometimes you try to write A, and nothing appears, so the sequence of letters shifts to the left, or it types AAA. Normal error correction codes don’t work well with that.”

ADS Codex adds additional information called error detection codes that can be used to validate the data. When the software converts the data back to binary, it tests if the codes match. If they don’t, ACOMA tries removing or adding nucleotides until the verification succeeds.

Microsoft and University of Washington DNA Storage Research Project

Smart Scale-up

Large warehouses contain today’s largest data centers, with storage at the exabyte scale—that’s a trillion million bytes or more. Costing billions to build, power, and run, this type of digitally based data centers may not be the best option as the need for data storage continues to grow exponentially.

Long-term storage with cheaper media is important for the national security mission of Los Alamos and others. “At Los Alamos, we have some of the oldest digital-only data and largest stores of data, starting from the 1940s,” Settlemyer said. “It still has tremendous value. Because we keep data forever, we’ve been at the tip of the spear for a long time when it comes to finding a cold-storage solution.”

Settlemyer said DNA storage has the potential to be a disruptive technology because it crosses between fields ripe with innovation. The MIST project is stimulating a new coalition among legacy storage vendors who make tape, DNA synthesis companies, DNA sequencing companies, and high-performance computing organizations like Los Alamos that are driving computers into ever-larger-scale regimes of science-based simulations that yield mind-boggling amounts of data that must be analyzed. 

Deeper Dive into DNA

When most people think of DNA, they think of life, not computers. But DNA is itself a four-letter code for passing along information about an organism. DNA molecules are made from four types of bases, or nucleotides, each identified by a letter: adenine (A), thymine (T), guanine (G), and cytosine (C). 

These bases wrap in a twisted chain around each other—the familiar double helix—to form the molecule. The arrangement of these letters into sequences creates a code that tells an organism how to form. The complete set of DNA molecules makes up the genome—the blueprint of your body.  

By synthesizing DNA molecules—making them from scratch—researchers have found they can specify, or write, long strings of the letters A, C, G, and T and then read those sequences back. The process is analogous to how a computer stores information using 0s and 1s. The method has been proven to work, but reading and writing the DNA-encoded files currently takes a long time, Ionkov said. 

“Appending a single nucleotide to DNA is very slow. It takes a minute,” Ionkov said. “Imagine writing a file to a hard drive taking more than a decade. So that problem is solved by going massively parallel. You write tens of millions of molecules simultaneously to speed it up.” 

While various companies are working on different ways of synthesizing to address this problem, ADS Codex can be adapted to every approach. 

DNA Data storage

DNA Storage as a Solution to our Data Storage Crissis

There are those who argue that, over the last two decades, we’ve moved from an oil-centric economy to a data-focused one. While this may seem like something of an exaggeration, it’s impossible to deny that data is playing an increasing role in our day-to-day life.  And this is most apparent in the world of business and big data.

Unfortunately, with an increased need for data comes a greater need for data storage – which has created a very big problem!

The current situation

As we’ve discussed previously, data is now the driving force behind a multitude of business decisions. Additionally, an immeasurable number of valuable insights can be gleaned from the vast swathes of data that organizations are yet to analyze. In short, big data is big business. And this – coupled with increasingly more affordable technology – has seen us create data at a truly unprecedented rate.

Karin Strauss - DNA Storage

In April 2013, 90% of all of the world’s data had been created within the previous two years. Today, the total amount of data in existence is doubling every other year. Such growth is not sustainable, but with data having proven indispensable, simply creating less of it is not an option. In order to avert this crisis, what is needed is storage media that offers storage density that is vastly superior to what is currently available. Here are the most likely possibilities along with a summary of their pros and cons:

We improve existing media

Storage media stalwart Seagate has officially manufactured more than 40,000 hard drives featuring HAMR (heat-assisted magnetic recording) technology and they plan to begin shipping them later this year.

Unlocking the Potential of NVMe

HAMR – whereby a disk’s platter is heated prior to the writing process – has been developed in order to significantly increase hard drive’s storage capacity. This heating means that less space is needed to store data and results in hard drives that are therefore capable of achieving higher storage densities.

Infinite Memory Engine

Potential storage capacity

Seagate has stated that HAMR tech will allow them to produce drives with more than 20TBs of storage before the end of 2019 and 40TBs by 2023. No further estimations are provided, though the company does state that they’ve already begun developing its successor, heated-dot magnetic recording (HDMR) suggesting that there is more to come.

Practicality

As we’ve stated previously, Seagate has stated that they plan to ship HAMR drives before the end of the year. These drives also use the 3.5 standard carriages. Meaning that they’ll easily insert into existing arrays. The drives also remain relatively affordable in spite of the inclusion of this new technology.


The cons

Whilst a 40TB HDD would represent a significant improvement on what’s currently available, it’s still unlikely to represent a long-term solution to the problem – unless HDMR proves capable of significantly boosting their capacity, that is.

We use our existing storage more efficiently

Sia is an example of what is, in our opinion, a unique and highly innovative potential answer to our current storage problems: it identifies unused space on various pieces of storage media throughout the globe, rents it from those users and then sells it to the general public as remote cloud storage.

Potential storage capacity

Sia’s website claims they’re an entire network of drives boats 4.2 Petabytes (4,200TBs) of storage. This may seem like a lot at first glance, but with the entire cloud currently storing just under 1,500 Exabytes (that’s 150,000 Petabytes) it doesn’t offer the kind of capacity needed to offer a real solution to our data storage crisis.

That said, 4.2 Petabytes is a considerable amount of storage that would otherwise have been wasted and Sia are also not the only company leveraging this technology. So, whilst decentralized cloud storage alone isn’t the answer we’re looking for, it’s far from ineffectual and is certainly an efficient way of utilizing existing storage space.

Practicality

As with most cloud storage, it’s easy to use and, thanks to the use of blockchain, is extremely cheap at just $2 per TB of storage.

Cons

We’ve already said that the decentralized cloud’s unlikely to offer the kind of storage capacity the world’s going to need to avert our impending storage crisis. We also know that trust in the cloud tends to diminish with each high-profile data breach so we’d expect the adoption rate to be somewhat slow.

What You Need to Know about DNA Data Storage Today

We use something ground-breaking

It may sound like pure fiction, but DNA has already been used to store and retrieve data. In fact, DNA data storage is something that has a lot of people very, very excited.

Potential storage capacity

This is what sets DNA data storage apart from its competitors: just one gram of DNA could store 215 Petabytes of data. With storage capacities like this, it’s clear to see why many believe DNA could be the answer to our storage conundrum.

Practicality

As I’m sure you can imagine, the process of storing and retrieving data from DNA is cumbersome and, whilst it was first achieved five years ago, it’s still far from an accessible and practical means of storing data.

Whilst using DNA to create a useable piece of storage media is proving to be problematic, though, it could not only produce a device capable of storing a data center in a 3.5 inch HDD cradle but one robust enough to last for a millennium, also.

Cons

As we’ve said previously, no uniform way of reading and writing data to and from DNA currently exists. It’s also been widely reported that the task of retrieving the data itself is both a slow and cumbersome one. These, however, are not the greatest hurdles scientists face in trying to make DNA the world’s de facto storage media: that honor belongs to cost.

Ultra-dense data storage and extreme parallelism with electronic-molecular systems

In 2017, data was successfully stored in and then retrieved from DNA but the cost of doing so was astonishingly high: synthesizing the data cost $7,000 and retrieving it a further $2,000. These are expected to drop significantly over the next few years but this could take as much as a decade.

“One of the challenges for us as a company, and us as an industry, is that many of the technologies we rely on are beginning to get to the point where either they are at the end, or they’re starting to get to the point where you can see the end. Moore’s Law is a well-publicized one and we hit it some time ago. And that’s a great opportunity, because whenever you get that rollover, you get an opportunity to be able do things differently, to have new ways of doing things.”

– ANT ROWSTRON, DISTINGUISHED ENGINEER AND DEPUTY LAB DIRECTOR, MICROSOFT RESEARCH CAMBRIDGE

It is projected that around 125 zettabytes of data will be generated annually by 2024. Storing this data efficiently and cost-effectively will be a huge challenge. Growth in storage capabilities using SSDs, HDDs or magnetic tape has not kept up with the exponential growth in compute capacity, the surge in data being generated, or the novel economics and storage needs of cloud services.

Future demands from intelligent edge and Internet of Things (IoT) deployments, streaming audio, video, virtual and mixed reality, “digital twins” and use cases we haven’t yet predicted will generate lots of bits – but where will we keep them?

This requires more than incremental improvements – it demands disruptive innovation. For many years, Microsoft researchers and their collaborators have been exploring ways to make existing storage approaches more efficient and cost-effective, while also forging entirely new paths – including storing data in media such as glass, holograms and even DNA.

DNA Data Storage and Near-Molecule Processing for the Yottabyte Era

Re-Imagining Storage

Researchers have taken a holistic approach to making storage more efficient and cost-effective, using the emergence of the cloud as an opportunity to completely re-think storage in an end-to-end fashion. They are co-designing new approaches across layers that are traditionally thought of as independent – blurring the lines between storage, memory and network. At the same time, they’re re-thinking storage from the media up – including storing data in media such as glass, holograms and even DNA.

This work extends back more than two decades: in 1999, Microsoft researchers began work on Farsite, a secure and scalable file system that logically functions as a centralized file server, but is physically distributed among a set of untrusted computers. This approach would utilize the unused storage and network resources of desktop computers to provide a service that is reliable, available and secure despite running on machines that are unreliable, often unavailable and of limited security. In 2007, researchers published a paper that explored the conditions under which such a system could be scalable, the software engineering environment used to build the system, and the lessons learned in its development.

At Microsoft Research Cambridge, researchers began exploring optimizing enterprise storage using off-the-shelf hardware in the late 2000s. They explored off-loading data from overloaded volumes to virtual stores to reduce power consumption and to better accommodate peak I/O request rates. This was an early example of storage virtualization and software-defined storage, ideas that are now widely used in the cloud. As solid-state drives (SSDs) became more commonplace in PCs, researchers considered their application in the datacenter – and concluded that they were not yet cost-effective for most workloads at current prices. In 2012, they analyzed potential applications of non-volatile main memory (NVRAM) and proposed whole-system persistence (WSP), an approach to database and key-value store recovery in which memory rather than disk is used to recover an application’s state when it fails – blurring the lines between memory and storage.

In 2011, researchers established the Software-Defined Storage Architectures project, which brought the idea of separating control flow from data from networking to storage, to provide predictable performance and reduced cost. IOFlow is a software-defined storage architecture that uses a logically centralized control plane to enable end-to-end policies and QoS guarantees, which required rethinking across the data center storage and network stack. This principle was extended to other cloud resources to create a virtual data center per tenant. In this 2017 article the researchers describe the advantages of intentionally blurring the lines between virtual storage and virtual networking.

Established in 2012, the FaRM project explored new approaches to using main memory for storing data with a distributed computing platform that exploits remote direct memory access (RDMA) communication to improve both latency and throughput by an order of magnitude compared to main memory systems that use TCP/IP. By providing both strong consistency and high performance – challenging the conventional wisdom that high performance required weak consistency – FaRM allowed developers to focus on program logic rather than handling consistency violations. Initial developer experience highlighted the need for strong consistency for aborted transactions as well as committed ones – and this was then achieved using loosely synchronized clocks.

At the same time, Project Pelican addressed the storage needs of “cold” or infrequently-accessed data.  Pelican is a rack-scale disk-based storage unit that trades latency for cost; using a unique data layout and IO scheduling scheme to constrain resources usage so that only 8% of its drives can spin concurrently. Pelican was an example of rack-scale co-design: rethinking the storage, compute, and networking hardware as well as the software stack at a rack scale to deliver value at cloud scale.

DNA Storage for Digital Preservation

To further challenge traditional ways of thinking about the storage media and controller stack, researchers began to consider whether a general-purpose CPU was even necessary for many operations. To this end, Project Honeycomb tackles the challenges of building complex abstractions using FPGAs in CPU-free custom hardware, leaving CPU-based units to focus on control-plane operations.

DNA synthesis and sequencing: writing and reading the code

DNA is the carrier of genetic information in nearly all living organisms. This information is stored as a code made up of four chemical bases: adenine (A), guanine (G), cytosine (C), and thymine (T). The order of these bases – called the sequence – then determines what information is available for building and maintaining an organism.

DNA bases pair up: A with T, and C with G. These base pairs, along with a couple of other components (sugar and phosphate), then arrange themselves up along two long strands to form the ladder-like helix we’re so accustomed to seeing when we think about DNA.

What does this all have to do with data storage? As it turns out, binary code (all those 0’s and 1’s) can be translated into DNA base pairings.

DNA data storage is the process of encoding binary data into synthetic strands of DNA. Binary digits (bits) are converted from 0s and 1s to the four chemicals bases (A, T, C, and G), such that the DNA sequence corresponds to the order of the bits in a digital file. In this way, the physical storage medium becomes a man-made chain of DNA.

Recovering the data is then a matter of sequencing that DNA. DNA sequencing determines the order of those four chemical building blocks, or bases, that make up the DNA molecule, and is generally used to determine the genetic information carried out in a particular DNA strand.

By running digital data-containing synthetic DNA through a sequencer, the genetic code – or sequence of bases – can be obtained and translated back into the original binary bits to access that stored data.

HovercRaft: Achieving Scalability and Fault-tolerance for Microsecond-scale Datacenter Services

Why we need DNA Storage

On Earth right now, there are about 10 trillion gigabytes of digital data, and every day, humans produce emails, photos, tweets, and other digital files that add up to another 2.5 million gigabytes of data. Much of this data is stored in enormous facilities known as exabyte data centers (an exabyte is 1 billion gigabytes), which can be the size of several football fields and cost around $1 billion to build and maintain.

Many scientists believe that an alternative solution lies in the molecule that contains our genetic information: DNA, which evolved to store massive quantities of information at very high density. A coffee mug full of DNA could theoretically store all of the world’s data, says Mark Bathe, an MIT professor of biological engineering.

“We need new solutions for storing these massive amounts of data that the world is accumulating, especially the archival data,” says Bathe, who is also an associate member of the Broad Institute of MIT and Harvard. “DNA is a thousandfold denser than even flash memory, and another property that’s interesting is that once you make the DNA polymer, it doesn’t consume any energy. You can write the DNA and then store it forever.”

Scientists have already demonstrated that they can encode images and pages of text as DNA. However, an easy way to pick out the desired file from a mixture of many pieces of DNA will also be needed. Bathe and his colleagues have now demonstrated one way to do that, by encapsulating each data file into a 6-micrometer particle of silica, which is labeled with short DNA sequences that reveal the contents.

Using this approach, the researchers demonstrated that they could accurately pull out individual images stored as DNA sequences from a set of 20 images. Given the number of possible labels that could be used, this approach could scale up to 1020 files.

Bathe is the senior author of the study, which appears today in Nature Materials. The lead authors of the paper are MIT senior postdoc James Banal, former MIT research associate Tyson Shepherd, and MIT graduate student Joseph Berleant.

Post-quantum cryptography: Supersingular isogenies for beginners

Stable storage

Digital storage systems encode text, photos, or any other kind of information as a series of 0s and 1s. This same information can be encoded in DNA using the four nucleotides that make up the genetic code: A, T, G, and C. For example, G and C could be used to represent 0 while A and T represent 1.

DNA has several other features that make it desirable as a storage medium: It is extremely stable, and it is fairly easy (but expensive) to synthesize and sequence. Also, because of its high density — each nucleotide, equivalent to up to two bits, is about 1 cubic nanometer — an exabyte of data stored as DNA could fit in the palm of your hand.

One obstacle to this kind of data storage is the cost of synthesizing such large amounts of DNA. Currently it would cost $1 trillion to write one petabyte of data (1 million gigabytes). To become competitive with magnetic tape, which is often used to store archival data, Bathe estimates that the cost of DNA synthesis would need to drop by about six orders of magnitude. Bathe says he anticipates that will happen within a decade or two, similar to how the cost of storing information on flash drives has dropped dramatically over the past couple of decades.

Aside from the cost, the other major bottleneck in using DNA to store data is the difficulty in picking out the file you want from all the others.

“Assuming that the technologies for writing DNA get to a point where it’s cost-effective to write an exabyte or zettabyte of data in DNA, then what? You're going to have a pile of DNA, which is a gazillion files, images or movies and other stuff, and you need to find the one picture or movie you’re looking for,” Bathe says. “It’s like trying to find a needle in a haystack.”

Currently, DNA files are conventionally retrieved using PCR (polymerase chain reaction). Each DNA data file includes a sequence that binds to a particular PCR primer. To pull out a specific file, that primer is added to the sample to find and amplify the desired sequence. However, one drawback to this approach is that there can be crosstalk between the primer and off-target DNA sequences, leading unwanted files to be pulled out. Also, the PCR retrieval process requires enzymes and ends up consuming most of the DNA that was in the pool.

“You’re kind of burning the haystack to find the needle, because all the other DNA is not getting amplified and you’re basically throwing it away,” Bathe says.

Quantum-safe cryptography: Securing today’s data against tomorrow’s computers

File retrieval

As an alternative approach, the MIT team developed a new retrieval technique that involves encapsulating each DNA file into a small silica particle. Each capsule is labeled with single-stranded DNA “barcodes” that correspond to the contents of the file. To demonstrate this approach in a cost-effective manner, the researchers encoded 20 different images into pieces of DNA about 3,000 nucleotides long, which is equivalent to about 100 bytes. (They also showed that the capsules could fit DNA files up to a gigabyte in size.)

Each file was labeled with barcodes corresponding to labels such as “cat” or “airplane.” When the researchers want to pull out a specific image, they remove a sample of the DNA and add primers that correspond to the labels they’re looking for — for example, “cat,” “orange,” and “wild” for an image of a tiger, or “cat,” “orange,” and “domestic” for a housecat.

The primers are labeled with fluorescent or magnetic particles, making it easy to pull out and identify any matches from the sample. This allows the desired file to be removed while leaving the rest of the DNA intact to be put back into storage. Their retrieval process allows Boolean logic statements such as “president AND 18th century” to generate George Washington as a result, similar to what is retrieved with a Google image search.

“At the current state of our proof-of-concept, we’re at the 1 kilobyte per second search rate. Our file system’s search rate is determined by the data size per capsule, which is currently limited by the prohibitive cost to write even 100 megabytes worth of data on DNA, and the number of sorters we can use in parallel. If DNA synthesis becomes cheap enough, we would be able to maximize the data size we can store per file with our approach,” Banal says.

For their barcodes, the researchers used single-stranded DNA sequences from a library of 100,000 sequences, each about 25 nucleotides long, developed by Stephen Elledge, a professor of genetics and medicine at Harvard Medical School. If you put two of these labels on each file, you can uniquely label 1010 (10 billion) different files, and with four labels on each, you can uniquely label 1020 files.

George Church, a professor of genetics at Harvard Medical School, describes the technique as “a giant leap for knowledge management and search tech.”

“The rapid progress in writing, copying, reading, and low-energy archival data storage in DNA form has left poorly explored opportunities for precise retrieval of data files from huge (1021 byte, zetta-scale) databases,” says Church, who was not involved in the study. “The new study spectacularly addresses this using a completely independent outer layer of DNA and leveraging different properties of DNA (hybridization rather than sequencing), and moreover, using existing instruments and chemistries.”

Bathe envisions that this kind of DNA encapsulation could be useful for storing “cold” data, that is, data that is kept in an archive and not accessed very often. His lab is spinning out a startup, Cache DNA, that is now developing technology for long-term storage of DNA, both for DNA data storage in the long-term, and clinical and other preexisting DNA samples in the near-term.

“While it may be a while before DNA is viable as a data storage medium, there already exists a pressing need today for low-cost, massive storage solutions for preexisting DNA and RNA samples from Covid-19 testing, human genomic sequencing, and other areas of genomics,” Bathe says.


More Information

https://www.microsoft.com/en-us/research/project/dna-storage/#!publications

http://thewindowsupdate.com/2020/11/09/research-collection-re-inventing-storage-for-the-cloud-era/

https://www.microsoft.com/en-us/research/project/dna-storage/

https://community.arm.com/developer/research/b/articles/posts/research-in-a-post-moore-era-hpca-2019

https://www.newswise.com/articles/translation-software-enables-efficient-dna-data-storage

https://leciir.com/?blog_post=the-dna-based-solution-to-our-data-storage-crisis-where-its-at-in-2020

https://news.mit.edu/2021/dna-data-storage-0610

https://www.businesswire.com/news/home/20210610005250/en/DNA-Data-Storage-Alliance-Publishes-First-White-Paper-Launches-Website

https://researchr.org/publication/cidr-2019

https://news.microsoft.com/innovation-stories/hello-data-dna-storage/

https://www.nature.com/articles/s41598-019-41228-8

https://blog.dshr.org/2021/05/storage-update.html

https://www.ddn.com/products/ime-flash-native-data-cache/


Share:

0 reacties:

Post a Comment