Montag, 10. August 2020

It is nonsense to consider significant properties only at file level

As it looks, most archives raise significant properties at the file level (by the way, they often mean technical properties, which is not the same. But this is a topic for another blog post). But this is insufficient and I will give two examples.

Example 1 - Retro-digitised material

If monographs are scanned, as we do in-house, in order to preserve the originals and make them accessible to users, images are created.If you look at these image files, you can determine the following significant characteristics

  • readable
  • accessible for OCR analysis
  • reproducible
  • maybe even true to color


These properties can then be used to define technical parameters that can be found in certain requirement profiles and can lead, for example, to the recommendation of the TIFF file format.

In the above consideration, the list of the significant property "the order of the scans should correspond to the original" (pagination) is missing. This property could be implemented by combining all scan pages into one file format, e.g. as BigTIFF or PDF/A. However, there may be good reasons not to include all pages in one file. What next? The remaining option is to add a file describing the structure of the digitized material in addition to the TIFF files. This can be a METS XML file, for example. METS is a good choice because it was created for this very purpose. Hmmm, is METS not a metadata format? And doesn't metadata belong outside of the payload? And isn't METS used by several archive information systems to map the AIPs? So can I not pack the structuring data into it?


It is true, METS is a metadata format. And it is true that METS is often used to describe container structures in SIPs or AIPs. But we have to distinguish between metadata describing the IE (i.e. the payload) and metadata inherently belonging to the payload. This is not easy, but here the significant properties help us: If the METS is used, as in our example, to represent the significant property "pagination", then the METS is part of the IE, otherwise it is not.

Now you might be tempted to get sloppy and just put the "pagination" into the METS of the AIP. Is that a good idea? No. Because IE should be kept available and usable. The AIP should only contain the metadata necessary to ensure availability. But when a user later accesses the payload via DIP, he should have everything together, i.e.: an intellectual unit as it was actually intended. This is the principle of independence.

I admit that sounds abstract and difficult. But let us try an analogy. If I have loose pages where the order is important, then the order is important, whether the page is archived or not. For example, I tie them to a book or use other techniques. This is my intellectual unit that I want to archive. I put the whole thing in a box and write on it what is in it and what happened to the box or the content during archiving. This is then my AIP. If I want to hand over the contents of this box to someone later, they don't necessarily have to be interested in what happened to the box, they can take the contents and work with them and know exactly in which order the pages follow each other.

Example 2 - Web page

I would like to present a second example to illustrate another aspect. Let us assume that we are to archive a very specific web page, which for the sake of simplicity consists of an HTML document, CSV files and graphic files. If you look at the web page, there is always a link in the text between one of the CSV files and one graphic file. The assignment could be the visualization of an experiment. It is only important to the department that the values, the textual content and the assignment to the graphic are not lost. Together with the department we determined the significant properties and after a lot of effort we transferred the website (IE) into the long-term archive. After some time we found out that the graphic files were subject to format obsolescence and had to be migrated to a new format. We decide on the new image archive format PNG/A and migrate the old files.

But is this sufficient? No. The HTML document still contains the file name of the old format. Should we change the file name or leave it as it is? The principle of least surprise speaks for "change". But if we change the file names during the migration, we impossibly have to change the file names in the HTML document as well.

Let's summarize

  1. Significant properties belong at the level of IE recorded. They are not file dependent.
  2. Metadata, which is essential to represent the relationship of objects within an IE, is mandatory part of an IE
  3. Format migrations can result in changes to other parts of the IE, even if they are not migrated themselves
  4. Metadata and data that are inside an IE must never refer to data or metadata outside
  5. Metadata outside of an IE, however, may already reference metadata and data of an IE.  

Whew, that was a lot of thinking, but I hope it was worth thinking about it.

Keine Kommentare:

Kommentar veröffentlichen