Wednesday, 2 July 2008

Responses to RAW versus TIFF: compression, error and cost-related

This is the second post summarising responses to the “RAW versus TIFF” post made originally by Dave Thompson of the Wellcome Library. The key elements of Dave’s post were whether we should be archiving using RAW or TIFF (image-related responses to this question are summarized in a separate post). A subsidiary question on whether we should archive both is greatly affected by cost, which is dependent on issues of compression, errors etc. Responses on these topics are covered in this post. As I mentioned before, these responses came on the semi-closed DCC-Associates email list.

Part way through the email list discussion, Marc Fresko of Serco reminded us that there are two questions:
“One is the question addressed by respondents so far, namely what to do in principle. The answer is clear: best practice is to keep both original and preservation formats.

The second question is "what to do in a specific instance". It seems to me that this has not yet been addressed fully. This second question arises only if there is a problem (cost) associated with the best practice approach. If this is the case, then you have to look at the specifics of the situation, not just the general principles.”
Marc’s comment brings us back to costs, and the responses to this are summarised here.

Sean Martin of the British Library picked up Marc’s point on costs:
“The cost of storage has and will get cheaper (typically 30% a year) but storage is not free. While most other examples here have cited raw vs TIFF, I will use an example based on TIFF vs a compressed format, such as JP2K.

(1) While there will be considerable variations, loss less compression with JP2K requires ~1/3 the storage of a TIFF. (I have a specific example of a 135 Mbyte TIFF and a 42 Mbyte loss less JP2K.)

(2) The cost of ownership of bulk commodity storage is currently of the order of £1000 per Tera Byte (However, let's treat this as a guide rather than a precise figure.)

(3) If you are dealing with only a few Tbytes then it will probably not matter much.

(4) However, if you envisage, say 100 Tbytes compressed or 300 Tbytes raw then it probably does matter, with indicative costs of £100K and £300K respectively.

(5) One way to approach this is to assess (in some sense) what value will be created using the cheaper method and what ADDITIONAL VALUE will be created using the more expensive method. This is a form of options appraisal.

(6) In my example, probably only a small amount of additional value is created for the additional expense approaching £200K. This leads to the question "if we had £200K on what would we spend it?" and probably the answer is "not in this way".

(7) However, this line of reasoning can only be applied in the context of a specific situation. Hence, for example, some might rate that considerable additional value is created and is affordable by storing RAW and TIFF, while others might chose to store only JP2K.

(8) This means that different people facing different challenges are likely to come to different conclusions. There is no one answer that applies to everyone.”
Chris Puttick, CIO of Oxford Archaeology thought that Sean’s cost figures were high:
“But the more nearly precise figure is what affects the rest of the calculation. For protected online storage (NAS/iSCSI, RAID 5 SATA drives with multiple hotspares) you can source 30TB useable space for £10k i.e. nearer to £50/TB [compared to Sean’s £1000 per TB; excludes power and cooling costs].

So using the boxes described above in a mesh array we can do 300TB for ~£120k. Anyone who needs 300TB of storage should find £120k of capital expenditure neither here nor there. After all, if we say the average HQ RAW is around 60MB, go with the 1/3 ratio for the additionally stored derivative image, we need 80MB per image i.e. 300TB=3.75m images... So a cost per stored image of 1p/year for the worst-case lifespan of the kit.”
To Sean’s cost versus value question, Chris responds:
“And in the wider world we have to fight to get digital preservation accepted as an issue, let alone tell people we have to pay for it. Put the question in the way that leads to the favourable response i.e. are you willing to spend a penny per image?”
Responding to the question about whether storing RAW as well as a compressed form is worthwhile, Chris points out:
“[…] many, asked the question in the wrong way, would opt for the cheapest without actually understanding the issues: JP2K is lossless, but was information lost in the translation from the RAW?
Richard Wright of BBC Future Media & Technology had an interesting point on the potential fragility risk introduced by compression:
“Regarding compression -- there was a very interesting paper yesterday at the Imaging Science & Technology conference in Bern, by Volker Heydegger of Cologne University. He looked at the consequences of bit-level errors. In a perfect world these don't occur, or are recovered by error-correction technology. But then there's the real world.

His finding was that bit and byte losses stay localised in uncompressed files, but have consequences far beyond their absolute size when dealing with compressed files. A one-byte error has a tiny effect on an uncompressed TIFF (as measured by 'bytes affected', because only one byte is affected). But in a lossless JP2 file, 17% of the data in the file is affected! This is because the data in the compressed file are used in calculations that 'restore' the uncompressed data -- and one bad number in a calculation is much worse than one bad number in a sequence of pixel data.

It goes on and on. A 0.1% error rate affects about 0.5% of a TIFF, and 75% of a lossless JP2.

The BBC is very interested in how to cope with non-zero error rates, because we're already making about 2 petabytes of digital video per year in an archive preservation project, and the BBC as a whole will produce significantly larger amounts of born-digital video. We're going for uncompressed storage -- for simplicity and robustness, and because "error-compensation" schemes are possible for uncompressed data but appear to be impossible for compressed data. By error-compensation I mean simple things like identifying where (in an image) an error has occurred, and then using standard technology like "repeating the previous line" to compensate. Analogue technology (video recorders) relied on error-compensation, because analogue video-tape was a messy world. We now need robust digital technology, and compression appears to be the opposite: brittle.”
I do think this is an extremely interesting point, and one that perhaps deserved a blog post all of its own, to emphasise it. In later private correspondence, Richard mentioned there was some controversy about this paper, with one person apparently suggesting that the factor-three compression might allow two other copies to be made, thus reducing the fragility. Hmmmm...

David Rosenthal of Stanford and LOCKSS picked up on this, noting that the real world really does see significant levels of disk errors. He also picked up on Sean’s comments, but unlike Chris, thinks that Sean’s costs are too low (and that Chris was way too optimistic):
“The cost for a single copy is not realistic for preservation. The San Diego Supercomputer Center reported last year that the cost of keeping one disk copy online and three tape copies in a robot was about $3K/Tb/yr.

Amazon's S3 service in Europe costs $2160/TB/yr but it is not clear how reliable the service is. Last time I checked the terms & conditions they had no liability whatsoever for loss and damage to your data. Note also that moving data in or out of S3 costs too - $100/TB in and sliding scale per month for transfers out. Dynamic economic effects too complex to discuss here mean that it is very unlikely that anyone can significantly undercut Amazon's pricing unless they're operating at Amazon/Google/... scale.

On these figures [costs] would be something like $300K per year compressed or $900K per year [un-compressed].

And discussions of preservation with a one-year cost horizon are unrealistic too. At 30% per year decrease, deciding to store these images for 10 years is a commitment to spend $10M compressed or $30M uncompressed over those ten years. Now we're talking serious money.

Sean is right that there is no one answer, but there is one question everyone needs to ask in discussions like this - how much can you afford to pay?”
Reagan Moore amplified:
“Actually the cost at SDSC for storing data is now:

~$420/Tbyte/year for a tape copy
~$1050/Tbyte/year for a disk copy. This cost is expected to come down with the availability of Sun Thumper technology to around $800/Tbyte/year.

As equipment becomes cheaper, the amortized cost for a service decreases. The above costs include labor, hardware and software maintenance, media replacement every 5 years, capital equipment replacement every 5 years, software licenses, electricity, and overhead.”
David responded:
“[…] To oversimplify, what this means is that if you can afford to keep stuff for a decade, you can afford to keep it forever. But this all depends on disk and tape technology continuing to get better at about the rate it has been for at least the next decade. The omens look good enough to use as planning assumptions, but anything could happen. And we need to keep in mind that in digital preservation terms a decade is a blink of an eye.”
David Barret-Hague of Plasmon believes that the cost of long term storage “is also a function of the frequency of data migrations - these involve manual interactions which are never going to get cheaper even if the underlying hardware does.” So he was not sure how affording it for 10 years means you can store it forever. (His company also has a white paper on cost of ownership of archival storage, available at http://www.plasmon.com/resources/whitepapers.html.)

Reagan responded agreeing that a long term archive cannot afford to support manual interactions with each record. He added:
“Our current technology enables automation of preservation tasks, including execution of administrative functions and validation of assessment criteria. We automate storage on tape through use of tape silos.

With rule-based data management systems, data migrations can also be automated (copy data from old media, write to new media, validate checksums, register metadata). Current tape capacities are 1 Terabyte, with 6,000 cartridges per silo. With six silos, we can manage 36 Petabytes of data.

We do have to manually load the new tape cartridges into the silo. Fortunately, tape cartridge capacities continue to increase, implying the number of files written per tape increases over time. As long as storage capacities increase by a factor of two each tape generation, and the effective cost of a tape cartridge remains the same, the media cost remains finite (factor of two times the original media cost).

The labor cost, the replacement of the capital equipment, the electricity, software licenses, floor space, etc. are ongoing costs.”
Well this discussion could (and maybe will) go on. But with my head reeling at the thought of one location with 36 Petabytes, I think I’m going to call this a day!

2 comments:

  1. Thanks for the summation Chris, useful to have this discussion in one place. This is clearly an issue to which there are no right or wrong approaches.

    ReplyDelete
  2. Chris,
    Thanks for the summation.
    Anyway, I agree with anon, Online backup is becoming common these days. There have even been companies that have begun offering unlimited services. For now, have been confined to disk-to-tape, CDs, DVDs, external hard drives, flash memory and networks. These are no longer efficient, they are messy and may be misplaced easily. Therefore, I agree with the estimated 70-75%


    Great Post!
    Lauren

    ReplyDelete

Please note that this blog has a Creative Commons Attribution licence, and that by posting a comment you agree to your comment being published under this licence. You must be registered to comment, but I'm turning off moderation as an experiment.