Some years ago my colleague Thom Hickey encountered the rubric "Quantity has a quality all its own" (nominally attributed to Josef Stalin), and it has become a time-worn inside joke for some of us in OCLC Research ever since. Largely, I suppose, for the variety of circumstances that plead for its invocation. (One does wonder if it is as snappy in Russian as in English?) Anyway, the most interesting tidbit about preservation I've encountered at this meeting fits nicely.
David Rosenthal, of the LOCKSS and CLOCKSS efforts, advances an interesting argument about confidence in digital preservation systems. Its a thought experiment about measuring the effectiveness of large data store preservation systems. If you are inclined to take for granted the stability of digital media and the systematic efforts appropriate to its reliable management, reading this post should disturb your sleep.
David calls it the Petabyte For a Century argument, and it goes something like this:
An organization wants to measure the effectiveness of a preservation system purported to be able to sustain a Petabyte store for 100 years with a 50% chance of integrity loss. How might one model such a question?
One way to think about it (David's way) is in terms of the half life of bits. The requirement translates to 0.8 exabit-years of preservation with a 50% chance of success, or a half life of 0.8 exa-years, which as it turns out, is equivalent to 100,000,000 times the age of the universe. David goes on to make the point that the cost of measuring such a standard of performance turns out to be intractable (by at least 6 orders of magnitude. So... the problem is really hard, and assessing the effectiveness of possible solutions is pretty hard as well.
What kinda mileage does this car get?
Oh, I can't tell you!
Why?
Well, it would make the cost of the car 20 billion dollars!
OK... thanks anyway... How about the brakes?
Oh, I can't tell you!
David's argument is well worth reading, and while reasonable people may quibble about the measure of success, it illustrates the seriousness of the problem, and has sparked a substantial debate in the digital preservation community. I find it a convincing argument that bit-for-bit reproducibility is not going to be the standard by which we will measure large scale digital preservation efforts. Got a good alternative?
One participant in this working group (Richard, who I believe is from the BBC) raised the goal of simply having bit losses that are sub-catastrophic. That is, systems should have among their design criteria the ability to recover from errors without the loss of large chunks of otherwise uncorrupted data. A scratch on a vinyl disc will annoy, but you can still listen to those scratchy old LPs. A reading error on a CD or DVD, however, can make an entire album unusable. Returning to more graceful failure modes of earlier media can and should be part of the design of storage systems. And while we're talking about petabytes... the BBC is generating 4 of them each and every week. No secret why THEY are here.
A representative of the British Library in the same session provided another Quality of Quantity argument, and the resultant need for multi-site, self-checking, self-healing systems. Some stats:
- 150 million items
- 50 million cataloging records
- 750 million newsprint pages
- 5 billion pages
- 650 km of shelving
- 1.5 million disks and tapes
All lovely, impressive numbers, music to our more-is-more ears. As they (we all) go digital, however, some simple observations will make evident the need for substantial improvements in system behaviors. File-error monitoring identified 'bit rot' errors at a rate of ~1 per thousand files in a 3 year period. Not bad, eh? Well, it translates to corruption in 4,000 files per month in a 150 million file collection. Not acceptable.
I've been guilty of a certain glibness about digital preservation... Oh, the bucket-o-bits part is the easy part... things get really hard as you go up the ladder and have to keep the bits, and the applications, and the operating environments and the hardware all synchronized.... Well, that may still be true, but I won't gloss over the bit bit again.
-----
Moon over a church spire on Avenue George-V on a perfect Paris evening. My mother tried hard and largely without success to cure me of whining. Walking the streets of Paris with a camera may be the best possible remedy for jet-lag-induced self pity. (Hi Mom!)