Long Term Data Storage

I was thinking about data archival the other day because David Hagan was telling me about one time when he spoke to a group of librarians and told them to expect a data gap starting in 1950 and extending until we get serious about data preservation. According to David, the time when we stopped being able to save data was in the 50’s because xerography (which is the same as laser-printer technology) came into existence then. All other data archival techniques (magnetic storage on tape or disk, recordable CD’s, etc) are inferior to toner pressed onto paper, which is itself inferior to ink soaked into paper fibers. And, come to think of it, ink on paper is inferior to marks on clay tablets – though ink on paper has proven to be stable enough, since we use knowledge today gleaned from papyrus scrolls from thousands of years ago.

So this thread popped up on Slashdot, and made me think about it again.

Let’s go back to marks on clay for a second. Why are they the best data archival we know of? Because they are information storage in a physical arrangement of a chemically stable material. The key to data longevity is to recognize these two things, I think:

Data is stored by arranging some physical quantity in a measurable way
The longevity of the data is related to the stability of the physical quantity

A few observations:

The cost of data storage is related to the economics of the physical quantity used to represent the data, and the logistics of moving it past a reader. Magnetic spots on spinning platters = cheap; holes in paper = expensive.
The digital era has been dominated by people reducing the cost (and a related dimension, size) of the physical quantity being manipulated to store data, by trading off longevity of the data itself.

So taking all that into consideration, it seems to me to way to solve the data longevity problem is to put the necessary research into reducing the costs and storage density on a stable medium.

And where do you find funding for research like that? Three places come to mind: fundamental research done by universities with government funding, and foundation funding for work that relates to their vision, and venture capital to productize something that’s theoretically possible, but not yet a proven business.

People interested in this kind of thinking would also be interested in the Long Now Foundation. And I wonder if the Long Now Foundation would be interested in finding a cost-reduced, stable data store?

As for the vulture capitalists… what kind of business could be made from this opportunity? What if I made and operated the equipment to make archival an medium that took the form of “data artifacts”? I think I should I operate it as a manufacturing plant, taking data in and returning to the user the artifact with the data stored on/in it in a stable form. That lets me amortize the cost of my equipment across lots of data artifacts, which is good for the end-user as well. With the factory approach, there would be a fundamental question of data security, which would be easy to manage with open source (and thus transparent) encryption techniques. But the nice thing about the factory is that it would leave the safe storage of the artifacts in the hands of the customer. I don’t want to be on the hook for storing their data, and they don’t want to be paying rent to me to do it. Though there are lots of companies that currently pay rent to Iron Mountain to store physical artifacts of questionable value (hard disks are about the last place you should put data you want to ever see again *).

Here’s how the software side of it would work. I’m doing it first, because I’m a software guy, and an amateur encryption guy. Also, the software side is easy! You feed it a corpus of data to be stored, and it gives you several options and the current prices for implementing each of them. One option is RAID: it uses a N-of-M coding system, and offers to spit out M files, each of which need to be sent to be manufactured into an artifact. Then you need to store the M artifacts, and make sure that at least N of them survive and can be gathered into one place to be read. Another option the piece of software can offer the user is encryption, so that the artifact manufacturer cannot see the data. In fact, to protect the manufacturer from liability, I think I’d recommend that the software refuses to make output without encrypting it.

But encryption opens up the can of worms of how to keep the key next to the data forever. I propose epoxy. (No, I’m not kidding. KISS.)

And to protect the data owner and the manufacturer even more, there would be an independent foundation, operating with seed funding from the data archiving company, which manufactures and sells key artifacts at cost. That way, the key and the data are not sent to the same factory at the same time. They have a license to the patent covering the artifact creation equipment, but their license permits them to make artifacts only up to 100 kilobytes in capacity, which is enough to save the key and the meta-data on how to use the key, so that any mathematician 1000 years from now will be able to re-implement the stream cipher that the key material initializes.

So what are these artifacts? Well, that will be the object of the research, won’t it? The research program will come down to exploring this problem space for the sweet spot.

material longevity and cost
density of recording marks on the given material and the cost of recording them
reliability of reading equipment and the cost of reading the data

I suspect the right choice for cheap long term material will be some kind of ceramic (polymers are too chemically unstable and precious metals are, well, too precious). And I suspect the cheapest way to arrange that physical material into data will be to use a laser to burn spots onto it. Thus the reader will be a digital microscope with simple edge detection and decoding software.

No, I’m not making this up: I’m proposing to replace ink on paper with marks on clay.

Funny how the world turns, isn’t it?

\* From Data Storage Technology Assessment â€“ 2002 Projections through 2010:

High density HDD’s suffer more signal loss because the magnetic grain volumes are smaller. At high areal densities, the energy barrier between the thermal energy and the magnetic anisotropy of each magnetic particle is not high. Industry-established design criteria call for recorded data output loss of 1 dB or less per decade, or 2 dB loss over 20 years. A 2 dB data output loss is normally related to a bit error rate (BER) increase of X10 or less, which is well within the capability of the error correction system built into all data storage devices.

Decibels (dB) are a logarithmic measurement: the difference between 1dB of attenuation and 10dB of attenuation is exponential. The loss a mag tape or hard drive will undergo during 10’s of decades will result in a loss of any hope of recovering the data from within the background noise, error correction or no.