Kulturreste – Was von uns übrig bleibt…

Der Wert von digitalen Objekten - Fragment einer Untätigkeitsbewertung

2024-10-25T08:54:00.000-07:00

Als Mitarbeiter eine digitalen Langzeitarchivs ist man Kummer gewohnt und weiß, dass man oft dicke Bretter bohren muss, um Institutionen davon zu überzeugen, dass bestimmte digitale Objekte langzeit gesichert werden sollten.

Besonders schlecht steht es dabei um sogenannte born-digitals, originär digitale Objekte, die bei Verlust unwiederbringlich verloren sind.

Ein fatalistischer Ansatz ist es, "wenn es für die Nachwelt verloren gegangen ist, dann ist es so". Die darwinistische Erweiterung davon lautet "wenn es für die Nachwelt verloren gegangen ist, war es nicht wertvoll genug".

Was bedeutet also "dieses Objekt ist wertvoll"?

Diese Frage ist nicht nur aus versicherungstechnischer Sicht interessant, sondern könnte argumentativ helfen, die Berechtigung für digitale Langzeitarchivierung zu untermauern und Prozessentscheidungen zu versachlichen.

Wie bestimmt man den Wert von etwas, von dem man erst weiß, dass es gebraucht wird, wenn es nicht zugreifbar ist?

Wiederbeschaffung

Für die Wertermittlung gibt es verschiedene Ansätze. Einer wäre der, für die Kosten einer "Wiederbeschaffung". Für Retrodigitalisate würde dies den Kosten entsprechen, die für eine erneute Digitalisierung entstehen würden. Wenn Bücher im Regal fehlen, dann für wie viel Aufwand man für die Beschaffung ausgeben müsste.

Für unbekannte Dateiformate wären dies die Kosten, die für die Analyse des Dateiformates entstehen würden. In dem Zusammenhang bekommt man manchmal die Aussage zu hören: "Im Zweifel setzen wir da einen Doktorrand dran, der das für ein halbes Jahr untersucht".

Im Fall des Siebeck-Nachlass hatten wir es mit Disketten für eine Panasonic-Schreibmaschine zu tun. Diese zu interpretieren erfordert mindestens die Beschaffung einer solchen Schreibmaschine, die zur Zeit in den einschlägigen Portalen nicht zu bekommen ist. Für das reverse-engineering des Datenformates müsste man sicherlich auch nochmal ein halbes Jahr ansetzen.

Nicht entstandene Werke

Ein anderer Ansatz wäre zu ermitteln, wieviele Arbeiten aufgrund des Verlustes nicht zu dem Objekt entstanden sind. Sprich, man schaut sich an, wieviele Artikel, Forschungsarbeiten etc., sich mit einem digitalen Objekt beschäftigen. Gäbe es dieses Objekt nicht, gäbe es auch die darauf aufbauenden Arbeiten nicht.

Wert als reziproke Funktion der Häufigkeit

Aus den vorherigen Überlegungen ergibt sich, dass ein Objekt um so wertvoller ist, desto seltener es ist. Dabei ist "selten" Ausdruck von fehlender Redundanz.

Wenn ein Buch in einer Auflage von 10.000 Stück mal 10€ gekostet hat, und jetzt nur noch 2 Exemplare existieren, kann man dann

10€/2 * 10000 = 50000€

rechnen?

Aussicht

Hmmm. Sind diese Ansätze sinnvoll? So richtig nicht. In der Praxis würde man sich vielleicht typische Vertreter bestimmter Objekttypgruppen herausziehen und einen Wert bestimmen.

Wie auch immer, wenn jemand eine Idee hat, wie man es besser machen könnte, immer her damit!

Some thoughts about a minimalistic Archival Information System, part 3

2023-03-16T15:11:00.005-07:00

In the last blog post we considered what the basic structure of the information packages should look like and how we will deal with versioning. In the following I would like to describe further cornerstones of a minimalistic archival information system. These will then form the basis for a first implementation, which would go beyond the scope of this blog. As soon as there is news worth reporting, I will announce it here.

Data management

Other archival information systems sometimes make it too easy for themselves and use a database to manage information about the AIPs in the archive. In principle there is nothing wrong with this, but it often seems that it is forgotten that a basic principle of information packages is the intellectual unit (IE) of data and metadata. What does this mean? The idea is that an IE should be able to stand on its own at all times. Following this principle has two consequences. First, hierarchically nested IEs cannot exist unless self-contained IEs are encapsulated like a box of boxes. In other words: IEs that only contain references to other IEs are not possible because they would not be viable on their own.

The second consequence is that all metadata must always be in a consistent state, regardless of the state of the archive information system. In other words, there must be no contradictions between the information stored in the AIP and the information in the system's database.

Why is this important? The Archival Information Packages are ultimately the time capsules that will outlast the Archive. If everything breaks, but a copy of an AIP is still found on tape, it contained all the information needed to interpret the data to be preserved.

So for the management of the AIPs we define the following:

the AIP is the basis for everything. If there are inconsistencies in the archival information system, we first ask the AIPs.
In order to speed up the processing, we can use a database. But then there must be a way to generate the database from the AIPs.
If the AIP is the basis for everything, then we need a mechanism that ensures that if there are errors in the creation of an AIP or in the creation of a new AIP version, these can be rolled back.
The archive can and should only assume responsibility for the data entrusted to it if an AIP or a new AIP version could be successfully generated.

Things that make life easier

What has proven to be very helpful is the following:

We should only allow 1:1 mappings. This means that a SIP contains exactly one digital object
We do without nested IEs.
We ignore for now that copy operations are expensive. It follows that AIP updates always consist of SIPs with complete data and metadata.

Architectural decisions

A minimalist archival information system (MAIS) should have the following properties:

Implemented as open source for study, improvement and reuse
Command line oriented so there is a clear interface.
Concentration on the essentials, therefore no routing, but suitable for parallel use.
Fast and small enough not to waste resources.
Avoiding XML to keep code simple and metadata human-readable
BagIt as base for SIPs, AIPs and DIPs
Preservation Planning and Action as an external operation on a set of AIPs
Implementation in a programming language that can be used for all common operating systems without contortions.

And now?

I plan to tackle the programming in the coming weeks and months. I will probably not go into detail about the individual steps of programming here. As soon as there is something presentable, I'll let you know. Otherwise let me know what your experiences are, which details are important to you with an AIS, especially if it should be particularly lightweight.

Some thoughts about a minimalistic Archival Information System, part 2

2023-03-07T10:17:00.001-08:00

In the last post I explained some basic terms. Now it's time for the real thing.

Choosing the right format

The first question, what should the information packages (SIP, AIP, DIP) look like? It is important that they are easy to process, easy to understand and easy to expand. Fortunately, there is RFC8493 that has the solution ready for us: BagIt.

In (1) we store the metadata, in (2) there is space for our payload. BagIt is simple, it is a definition of a directory structure and some files that take over certain functions. Very interesting for us, if we want to store digital objects, we can store them in the BagIt payload. We can take over this area completely unchanged when processing a SIP and creating the AIP. The same is possible later when creating the DIPs from the AIPs. BagIt gives a lot of freedom. To limit ourselves, we choose UTF-8 for all metadata and text files.And we don't use fetch bags. Since BagIt is now standardized, we use version 1.

Metadata and AIP update considerations

Many AIS systems are insufficiently prepared for metadata and AIP updates. In my experience, it is important to think about how and which data is updated and what the consequences are. In order to enable the producer to submit supplements, these must be clearly assigned to an existing AIP. Either you give the producer back an ID for his first recording. This is not a good choice because the process then has a strong coupling and internals are exposed to the outside world. In addition, if a producer wants to change the AIS, there can be collisions. A better choice is to tell the producer to choose a unique ID for your data yourself and transmit it in your SIPs. Internally, we would then use these to search for the appropriate AIPs. The ID is called "ExternalID" and is the base for our internal MAIS-AIP-ID. More on that later.

In the last post I already mentioned that we have to think about the topic of versioning of AIPs. Not only because of the metadata or AIP updates, but also in the case of a PP&A, i.e. format migration. A simple idea is to introduce linked lists.

This allows us to easily implement the functionality of rolling back an AIP version as well.

A new AIP points to the predecessor in which the new version receives a reference entry in the "bag-info.txt":

'MAIS-previous-AIP' - contains AIP-ID of the current AIS (MAIS-AIP-ID)
'MAIS-migrated-AIP' - contains AIP-ID of the previous AIS if AIP was migrated from there
'MAIS-origin-AIS' - contains identifiers of the previous AIS from where the AIP was migrated

The last two keys are optional and only needed if AIP-AIP-Transfer is needed to move digital objects from one archival information system to another.

Some thoughts about a minimalistic Archival Information System, part 1

2023-03-02T12:37:00.002-08:00

Many of those who are dealing with the digital preservation of objects for the first time and who work in small memory organizations are often helpless in the face of the vast range of functions and requirements of current archival information systems.
Students of library or archival science often appear to be similarly overwhelmed when they are supposed to learn what constitutes archival software.

This has motivated me to write down thoughts on a minimalist archive information system. Because it really doesn't need much.

The basic terms

An archive essentially has three roles: the submitter, called the producer, the user, also called the consumer, and the problem solver who maintains the archive, also called the technical analyst.

When digital objects are transferred to the archive, it is called the ingest process. When they are requested from the archive, then this is the access process.

The digital objects to be preserved are provided with all the necessary information for the archive ingest and are packaged in a predefined structure. This is called a Submission Information Package (SIP). You can actually imagine this just like in real life. For example, if you want to store a vase, you put it in a box, label it and put it on a shelf.

In the archive it is checked whether (allegorically) the vase is in the box and intact, and if there is a stamp and signature that says that the content of the package is indeed a vase. A file number and a storage location is assigned and the box goes sealed and neatly labeled on the shelf. The "box" is called Archival Information Package (AIP). With the seal, the archive takes responsibility.

At some point, when the user would like to see the vase from the archive again, the archive would process the request and send the vase and accompanying information to the user. This is then called a Dissemination Information Package (DIP).

In addition to this simple "I store something safely and retrieve it again at some point" approach, an archive fulfills another task that is not so obvious: it ensures that objects entrusted to it are kept usable.

What does that mean in the digital world?
If it is possible in principle to store a digital object securely with bit accuracy, even over a very long period of time (bitstream archival), it still can age because the environment for using this object is no longer available.

There are essentially three concepts for keeping digital objects usable (content preservation): hardware museum, emulation or format migration.

Hardware museums (e.g. a slot machine museum) try to keep old equipment running in a controlled environment. To do that, they have to build up a stock in time and build up knowledge on how to maintain and repair these devices.

With an emulation, I try to recreate the environment for the digital object so that it feels at home and doesn't notice any difference from the previous, real world. A very good example of emulators is e.g. MAME, but also various others, the e.g. retro computers like the Amiga or C64, so that old programs from their time can run on them. Here, too, I need knowledge about what the environment to be emulated looks like and how I can recreate it with today's means.

When migrating the format, I try to find a new form that retains the essential properties (significant properties) in good time and to transfer files from a digital object to a newer data format.

From this point onwards, it is assumed that this is the preferred way of maintaining usability.

It follows that an Archival Information System (AIS) must be able to support this process of format migration. The process (also called Preservation Planning and Action) results in a new version of the Archival Information Package being created. The AIS must be able to manage this.

That would basically be all there is to Archival Information Systems if it hadn't been for the librarians.

Unlike archivists, where a record is complete and closed, librarians understand the concepts of supplements and metadata submissions. A page that has fallen out has turned up here, a letter has been discovered there in an estate, or it has dawned on some people that there is now money for costly in-depth indexing. Ergo, librarians expect people to think about how to handle metadata and data updates on existing AIPs (called metadata update and AIP update). This is not trivial, since some AIPs are also very large and you want to avoid pointless copying. For such an update, we also need a good way for producers to tell the archive which AIP needs to be added or updated.

However, AIPs are already versioned in the case of format migration, the same can be used here as well. Any change to the AIP creates a new version of an AIP. And so that you can't accidentally break anything, you should always be able to go back to an old version. And since that is also error-prone, the result of the rollback process will simply be a new version.

That's it. It's nothing more. Easy, isn't it?

Draft for a differential BagIt

2022-05-12T09:39:00.003-07:00

The problem

BagIt (RFC 8493) forms the basis for Submission Information Packages (SIP) and Archival Information Packages (AIP) in many digital archives.
Especially in the library environment, it is necessary to support supplemental submissions in the Archival Information System (AIS) software. Supplements may be limited to metadata or may add new files, remove existing files, or replace existing files.
Unfortunately, there is no way to implement a differential SIP cleanly and easily in the BagIt specification.

The constraints

A design of a differential BagIt (dBagIt) should meet the following conditions:

1. existing BagIt should not be touched
2. it should be based on the BagIt structure so that the conversion effort is minimal
3. it should be easy to implement
4. it should support the "add" and "delete" operations
5. the checksum protection should be guaranteed
6. the referenced bag should be specified explicitly

The proposal of dBagIt

The basis is the structure of BagIt. The following are the changes that are mandatory.

Bag Declaration: dbagit.txt

In contrast to 2.1.1 of RFC8493 the filename is dbagit.txt

Payload Manifest

In contrast to 2.1.3 of RFC8493 each line of a payload manifest file MUST be of the form

   sign checksum filepath

where sign is either + for adding a file or - for deleting a file.

The replacement of files is simulated by one entry each for deleting and adding.

Bag Metadata: bag-info.txt

Additional to RFC8493 the key Updates-External-Identifier becomes mandatory. It is used to reference to the original data object, which will be updated by this dBagIt.

Optional Tag Manifest

The Tag Manifest is similar to RFC8493.

Although tag manifest files in BagIt could be used to describe additional proprietary subdirectories of a bag not specified in the RFC, it is not defined here to support changes as in the previous section on payload manifest. This facilitates the creation and processing of dBagIts.

Implementation of the behavior

The implementation must ensure that:

the target object referenced by key Updates-External-Identifier exists
the dBagIt is valid
the add/delete operations are atomic and rollback-able
the checksums of files which should be added are correct and part of the current payload
the checksum of files which should be deleted are similar to the checksum of the files in the referenced digital object
the files in tag manifests handled correctly if proprietary extensions used
the metadata content in bag-info.txt replaced previous versions in referenced object completely

Future

If there is interest, I would be happy to receive feedback via art1pirat ATgmail.com. Maybe a new RFC can grow out of it.

Alternate consideration

A very simple solution could also be the use of unified 'diff'. This also allows partial changes in files, but would hardly bring any advantages with binary data and is not quite as intuitive for users who are not familiar with IT.

FAQ (Update 2022-05-18)

What if "delete" references a non-existing file? The complete operations via differential BagIt should be atomar and consistent. In this case the operations are rollbacked and aborted with an error. This ensure that no unintented updates will be applied.
Wouldn't it be nice, to avoid transferring files, to allow a simple rename instead of a replace? This would be worth considering. however, a secure rename requires the checksum, the old filename, and the new filename. That makes it complicated again. Since the case would probably not be too frequent, this could be specified later if needed.
How is it ensured that of several files with the same checksum, the wrong file is not deleted or replaced? Since for "delete" the checksum and the path of the already existing file must be specified, a mix-up is impossible.
Is it correct that when I pass metadata in baginfo.txt, it overwrites the metadata in the referenced object? If yes, why? Yes, that is so. It simplifies the design to focus only on the payload. By the way, the purpose of differential BagIt is to reduce the cost of complete transfer of all files in case of supplement deliveries. And most of the costs are usually incurred in the transfer of the payload.

Detectorist - Part two "A crumb of knowledge"

2021-12-03T08:24:00.000-08:00

A crumb of knowledge

In the first part I described how I came to know how to read the floppy disks (using kryoflux). Now I would like to give an intermediate state about the floppy disk format of the Panasonic typewriter - in the quiet hope that someone could uncover the last secret.

I found the most important clue while researching a successor model - the Panasonc KX-W1000. I stumbled across the follow old blog post https://surrey.lug.org.uk/panasonic-kx-w1000.

My findings

Even if it didn't lead to a full success, there were some interesting insights. The floppy image is strongly related to FAT12.

Here is my summary.

The filesystem is based on FAT12 with proprietary extensions.

Header / MBR

The first bytes are: 0x00 00 00 4B 58 2D 57 31 35 31 30 20 31 2E 30 30 20, which corresponds to the string "KX-W1510 1.00" from the third byte onwards.

The first 256 bytes are very similar to a MBR of old DOS floppies:

0000:0000 | 00 00 00 4B  58 2D 57 31  35 31 30 20  31 2E 30 30 | ...KX-W1510 1.00
0000:0010 | 20 F9 00 00  00 00 00 00  00 00 00 00  00 00 00 00 |  ù..............
0000:0020 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:0030 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:0040 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:0050 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:0060 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:0070 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:0080 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:0090 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:00A0 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:00B0 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:00C0 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:00D0 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:00E0 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................
0000:00F0 | 00 00 00 00  00 00 00 00  00 00 00 00  00 00 00 00 | ................

FATs

There are two equal blocks which probably represent FATs, once at address 0x200:

0000:0200 | F9 FF FF 03  40 00 05 B0  00 07 80 00  09 A0 00 FF | ùÿÿ.@..°..... .ÿ
0000:0210 | FF FF 0D E0  00 0F 00 01  FF 8F 01 13  40 01 15 60 | ÿÿ.à....ÿ...@..`
0000:0220 | 01 17 F0 FF  19 90 02 1B  10 02 1D E0  01 1F 00 02 | ..ðÿ.......à....
0000:0230 | FF 2F 02 23  F0 FF 25 60  02 2D 80 02  2C A0 02 2B | ÿ/.#ðÿ%`.-.., .+
0000:0240 | F0 FF FF EF  02 35 00 03  31 20 03 33  F0 FF 36 F0 | ðÿÿï.5..1 .3ðÿ6ð
0000:0250 | FF 37 80 03  39 F0 FF 3B  C0 03 3D E0  03 FF 0F 04 | ÿ7..9ðÿ;À.=à.ÿ..
0000:0260 | 41 20 04 43  F0 FF 45 60  04 47 80 04  49 F0 FF 4B | A .CðÿE`.G..IðÿK
0000:0270 | C0 04 4D E0  04 4F F0 FF  51 20 05 53  40 05 55 F0 | À.Mà.OðÿQ .S@.Uð
0000:0280 | FF 57 80 05  59 A0 05 5B  F0 FF 5D E0  05 69 B0 07 | ÿW..Y .[ðÿ]à.i°.
0000:0290 | 7F 20 06 63  40 06 65 F0  FF 6E 80 06  6B A0 06 FF | . .c@.eðÿn..k .ÿ
0000:02A0 | CF 06 6D F0  FF 7E 00 07  71 20 07 73  40 07 FF 6F | Ï.mðÿ~..q .s@.ÿo
0000:02B0 | 07 77 80 07  79 A0 07 FF  CF 07 7D 60  08 80 30 08 | .w..y .ÿÏ.}`..0.
0000:02C0 | 81 20 08 FF  4F 08 85 80  08 87 F0 FF  FF AF 08 8B | . .ÿO.....ðÿÿ¯..
0000:02D0 | F0 08 8D E0  08 90 20 09  91 40 09 93  F0 FF FF 6F | ð..à.. ..@..ðÿÿo
0000:02E0 | 09 A2 F0 09  99 A0 09 9B  C0 09 9D F0  FF A1 00 0A | .¢ð.. ..À..ðÿ¡..
0000:02F0 | B2 60 0A A3  40 0A A5 F0  FF AC 80 0A  A9 A0 0A AB | ²`.£@.¥ðÿ¬..© .«
0000:0300 | F0 FF AD E0  0A AF 00 0B  B1 B0 0B BA  40 0B B5 C0 | ðÿ.à.¯..±°.º@.µÀ
0000:0310 | 0C B7 80 0B  B9 60 0C C3  C0 0B BD E0  0B BF 00 0C | .·..¹`.ÃÀ.½à.¿..
0000:0320 | C1 20 0C CA  40 0C C5 80  0C C7 90 0C  E8 00 0D CB | Á .Ê@.Å..Ç..è..Ë
0000:0330 | F0 FF CD E0  0C CF B0 0D  D1 50 0D D3  40 0D DE F0 | ðÿÍà.Ï°.ÑP.Ó@.Þð
0000:0340 | FF D7 E0 0E  D9 70 0E E2  C0 0D DD F0  FF DF 00 0E | ÿ×à.Ùp.âÀ.Ýðÿß..
0000:0350 | E1 50 0E E3  40 0E E6 90  0E FB C0 0E  FF AF 0E EB | áP.ã@.æ..ûÀ.ÿ¯.ë
0000:0360 | F0 FF ED 60  0F EF 00 0F  F1 20 0F F3  40 0F F5 F0 | ðÿí`.ï..ñ .ó@.õð
0000:0370 | FF F7 80 0F  F9 A0 0F FF  CF 0F FD E0  0F 08 01 00 | ÿ÷..ù .ÿÏ.ýà....

once at 0x800:

0000:0800 | F9 FF FF 03  40 00 05 B0  00 07 80 00  09 A0 00 FF | ùÿÿ.@..°..... .ÿ
0000:0810 | FF FF 0D E0  00 0F 00 01  FF 8F 01 13  40 01 15 60 | ÿÿ.à....ÿ...@..`
0000:0820 | 01 17 F0 FF  19 90 02 1B  10 02 1D E0  01 1F 00 02 | ..ðÿ.......à....
0000:0830 | FF 2F 02 23  F0 FF 25 60  02 2D 80 02  2C A0 02 2B | ÿ/.#ðÿ%`.-.., .+
0000:0840 | F0 FF FF EF  02 35 00 03  31 20 03 33  F0 FF 36 F0 | ðÿÿï.5..1 .3ðÿ6ð
0000:0850 | FF 37 80 03  39 F0 FF 3B  C0 03 3D E0  03 FF 0F 04 | ÿ7..9ðÿ;À.=à.ÿ..
0000:0860 | 41 20 04 43  F0 FF 45 60  04 47 80 04  49 F0 FF 4B | A .CðÿE`.G..IðÿK
0000:0870 | C0 04 4D E0  04 4F F0 FF  51 20 05 53  40 05 55 F0 | À.Mà.OðÿQ .S@.Uð
0000:0880 | FF 57 80 05  59 A0 05 5B  F0 FF 5D E0  05 69 B0 07 | ÿW..Y .[ðÿ]à.i°.
0000:0890 | 7F 20 06 63  40 06 65 F0  FF 6E 80 06  6B A0 06 FF | . .c@.eðÿn..k .ÿ
0000:08A0 | CF 06 6D F0  FF 7E 00 07  71 20 07 73  40 07 FF 6F | Ï.mðÿ~..q .s@.ÿo
0000:08B0 | 07 77 80 07  79 A0 07 FF  CF 07 7D 60  08 80 30 08 | .w..y .ÿÏ.}`..0.
0000:08C0 | 81 20 08 FF  4F 08 85 80  08 87 F0 FF  FF AF 08 8B | . .ÿO.....ðÿÿ¯..
0000:08D0 | F0 08 8D E0  08 90 20 09  91 40 09 93  F0 FF FF 6F | ð..à.. ..@..ðÿÿo
0000:08E0 | 09 A2 F0 09  99 A0 09 9B  C0 09 9D F0  FF A1 00 0A | .¢ð.. ..À..ðÿ¡..
0000:08F0 | B2 60 0A A3  40 0A A5 F0  FF AC 80 0A  A9 A0 0A AB | ²`.£@.¥ðÿ¬..© .«
0000:0900 | F0 FF AD E0  0A AF 00 0B  B1 B0 0B BA  40 0B B5 C0 | ðÿ.à.¯..±°.º@.µÀ
0000:0910 | 0C B7 80 0B  B9 60 0C C3  C0 0B BD E0  0B BF 00 0C | .·..¹`.ÃÀ.½à.¿..
0000:0920 | C1 20 0C CA  40 0C C5 80  0C C7 90 0C  E8 00 0D CB | Á .Ê@.Å..Ç..è..Ë
0000:0930 | F0 FF CD E0  0C CF B0 0D  D1 50 0D D3  40 0D DE F0 | ðÿÍà.Ï°.ÑP.Ó@.Þð
0000:0940 | FF D7 E0 0E  D9 70 0E E2  C0 0D DD F0  FF DF 00 0E | ÿ×à.Ùp.âÀ.Ýðÿß..
0000:0950 | E1 50 0E E3  40 0E E6 90  0E FB C0 0E  FF AF 0E EB | áP.ã@.æ..ûÀ.ÿ¯.ë
0000:0960 | F0 FF ED 60  0F EF 00 0F  F1 20 0F F3  40 0F F5 F0 | ðÿí`.ï..ñ .ó@.õð
0000:0970 | FF F7 80 0F  F9 A0 0F FF  CF 0F FD E0  0F 08 01 00 | ÿ÷..ù .ÿÏ.ýà....
0000:0980 | 00 00 00 00  00 00 00 00  00 00 00 00  FF 0F 00 00 | ............ÿ...

Umlauts ans Special chars

Umlauts and Special chars are mapped as follows:

ä → 0x7b
ö → 0x7c
ü → 0x7d
Ä → 0x5b
Ö → 0x5c
Ü → 0x5d
ß → 0x85
hyphen → 0xbc

Open Questions

What is still completely unclear is how the FATs are constructed. They do look like FAT12 entries, the first bytes 0xf9 0xff 0x03... and the frequently occurring 0xff suggest this, yet there seems to be no connection between the addresses of the text fragments in the image and the FAT byte sequences.

In the directory entries everything points to the fact that byte 26 indicates the start cluster and bytes 28-29 the file size, the connection with the FAT and the actual offset (or cluster) to the data I could not decipher yet.

The meaning of offset 0x100 is unclear.

If you have any ideas how to read the FATs, or how to interpret the bytes 26, 28-29 of the directory entries, or what the cluster size should be, feel free to write me.

If you are the owner of such an old typewriter, it would be helpful to have a clean-room floppy copy, i.e. a freshly formatted floppy with a small test text, so that I can reverse engineer the data format even better.

Just contact me at art1piratatgoogledotcom

Supportive Links

https://archive.org/details/MSXTechnicalDataBook/page/n269/mode/2up

https://github.com/Konamiman/MSX2-Technical-Handbook/blob/master/md/Chapter3.md#3--structure-of-disk-files

https://manualsbrain.com/ja/products/panasonic-kx-w1510/

Thanks

my thanks goes to

Panasonic Museum
David Murray, the 8-bit Guy
John Wash, for his analysis of kx-w1000 floppies, https://surrey.lug.org.uk/panasonic-kx-w1000
Jason Kleiner
to the Developers and Maintainers of Okteta, Kaitai, wxHexEditor, Debian, Kryoflux…

Detectorist - Part one "First indications"

2021-11-28T03:22:00.007-08:00

First indications

In an estate there are several CDROMs, DVDs and especially floppy disks. We were able to read most of them with Linux, including the floppy disks. Only on the last 8 floppy disks did we have a hard time.

On one of the floppy disks a small inscription peeked out, referring to a Panasonic electronic typewriter. There was none in the estate, no other information was available.

Eight floppy disks, 3,5", double density, not readable.

Time passed, constantly haunted by the voice in my mind: "There's something on the disks, only what?"

We managed to purchase a Kryoflux controller (see https://kryoflux.com/, there are also free opensource alternatives). This is a special disk controller that allows you to record the magnetic flux as the read heads move over the disk.

After the first attempts, I was able to create an image file with the following command:

./dtc -fIMAGEFILE -dd1 -g2 -i4

The options mean:

"-dd1" - double density
"-g2" - double sided
"-i4" - MFM sector image 40/80+ tracks

A look at the image using the hex editor showed that I was right with my intention. After the first three zero bytes, the string "KX-W1510 v1.00" followed (and to my happy surprise, a lot of readable text fragments).

Yep, there is exactly one electronic typewriter series from Panasonic.

Disillusionment

I was able to find a manual at https://manualsbrain.com/en/manuals/1814281/. And yes, the machine used 3.5" floppy disks, double sided, double density with a capacity of 713,000 characters, but unfortunately without an exact description of the disk format and the file system.

I then contacted Panasonic support - no success. I started researching patent databases in Japan, the USA and Germany - nothing. I wrote to the Panasonic museum in Japan, but unfortunately they could not help me.

A proprietary disk format, which was forgotten after 30 years.

In the next part I report what I could find out about the disk system of the Panasonic typewriter KX-W1510, and where I (still) fail...

Backup is digital long-term preservation!

2021-04-01T00:16:00.006-07:00

Exponential growth

https://www.statista.com/chart/17727/global-data-creation-forecasts/

An important observation is that the number of files produced each year continues to increase worldwide (see https://en.wikipedia.org/wiki/Information_explosion). And with it the number of digital objects increases in the same measure, for which we must decide: Keep or throw away?

The truth is, the discard scenario becomes the more likely one with each passing year.

Magnificent diversity

Another observation is that about 90 new file formats are added every year.
And the file formats that are being dropped are already in place.

The truth is, no one can build up format knowledge for this yet.

A fuzzy concept

When talking to colleagues, the topic of validation does not play a role. For one thing, no one is clear about what "valid" means. Valid against a specification? Valid against a profile? Valid because it can be opened by programs? On the other hand, nothing happens after that. If a file is broken, it is still archived. If it is not broken, fine.

The truth is, validation is useless.

Success factors

Do you know how the success of digital preservation is measured? I'll tell you, in terabytes per year. If the numbers go up, that's a good thing to sell to politicians. Whether it was difficult to prepare digital objects for long-term availability doesn't matter. Whether born-digitals are more at risk, never mind.

Is that the truth?

Overrated

It used to be said that long-term digital archiving could only be handled by organizations with a minimum of resources. Look around and you'll find dozens of one-man orchestras and part-time archives. And do you think that as the amount of data increases, so do the human resources? Oh, come on!

You know the truth!

That's too exhausting

If you've ever heard of format migration as a principle of long-term preservation, you've read in textbooks phrases like

To ensure format migration, the significant properties of groups of objects that must be preserved must be determined.

Have you ever seen an archive that has actually determined and documented significant properties?

The truth is, significant properties are determined after the fact from technical metadata.

Summary

So what is digital long-term preservation? Only an expensive backup.

Impossible - or how I learned to read data storage media at the speed of light and what it's good for

2021-01-27T06:59:00.011-08:00

When I receive data carriers from an inheritance, I want to get a quick overview of what is on the floppy disk, the CDROM, the USB stick or the hard disk drive so that I can look at the interesting things first.

But I only know what is there when I read the media, right? A typical chicken and egg problem.

https://openclipart.org/detail/212857/sci-fi-scanner-device

I discovered the crucial clue to the solution in a 2014 talk by Simon Garfinkel "Digital Forensics Innovation: Searching A Terabyte of Data in 10 minutes" (http://simson.net/ref/2014/2014-02-21_RPI_Forensics_Innovation.pdf)

What is Random Sampling?

Random sampling is nothing more than looking at only every n-th part of a total set and inferring the big picture.

To find out what is on a medium, it would be sufficient to look at random blocks and determine for them, based on their byte structure, whether they fall into the categories "empty", "random", "text", "video" or "undef".

Exactly this approach is implemented in the Perl module File::FormatIdentification::RandomSampling, which can be found on CPAN under https://metacpan.org/pod/File::FormatIdentification::RandomSampling.

The category "empty" is dominated by sequences of zero bytes, in the category "random" the byte values are almost equally distributed, in the category "text" values for the characters "a-z" from the ASCII character set appear frequently, "video" contains frequent byte sequences resulting from the basic structure of MPEG. And under "undef" everything else is subsumed.

Example

The above Perl module contains the program crazy_fast_image_scan.pl. The following simple call:

perl -I lib bin/crazy_fast_image_scan.pl --percent=0.000001 --image=/dev/mapper/laptop--vg-home

provides the following output:

Scanning Image /dev/mapper/laptop--vg-home with size 728982618112, checking 1423 sectors
scanning [...]
Estimate, that the image '/dev/mapper/laptop--vg-home'
has percent of following data types:
    44.6% random/encrypted/compressed
    35.6% undef
    11.0% empty
    5.4% video/audio
    3.5% text

The complete output is even more extensive. It is important to note that the examined partition was 668GB in size and was scanned in just 15s.

Limits

Importantly, the output provides only a rough estimate of what might be on the media. The choice of the sample size (here: via the --percentage parameter) determines the informative value of the estimate, as well as the duration until a result can be delivered.

More ideas

In the above module, I have implemented an experimental output of the MIME-Types potentially present on the media. This is not very stable yet and needs more work, but it can help to estimate even better whether the files on a disk are interesting enough to prioritize it. Here is an example output:

The next mimetype estimation is experimental and needs further work:
    87.9% unknown
    3.5% application/pdf
    1.1% video/quicktime
    0.8% image/gif
    0.8% text/java
    0.7% application/msword
    0.6% text/markdown
    0.6% application/vnd.openxmlformats-officedocument.wordprocessingml.document
    0.6% application/xml
    0.4% application/msaccess
    0.4% application/navimap
    0.4% application/rtf
    0.3% image/png
    0.2% application/arj
    0.1% application/vnd.ms-powerpoint
    0.1% text/html

The approach is to determine the MIME-Type of the files for a test corpus using other tools, determine typical bytegram values and pass the whole thing to a decision tree learner. If you are interested, you are welcome to contribute to the module.

Happy scanning!

It is nonsense to consider significant properties only at file level

2020-08-10T11:01:00.001-07:00

As it looks, most archives raise significant properties at the file level (by the way, they often mean technical properties, which is not the same. But this is a topic for another blog post). But this is insufficient and I will give two examples.

Example 1 - Retro-digitised material

If monographs are scanned, as we do in-house, in order to preserve the originals and make them accessible to users, images are created.If you look at these image files, you can determine the following significant characteristics

readable
accessible for OCR analysis
reproducible
maybe even true to color

These properties can then be used to define technical parameters that can be found in certain requirement profiles and can lead, for example, to the recommendation of the TIFF file format.

In the above consideration, the list of the significant property "the order of the scans should correspond to the original" (pagination) is missing. This property could be implemented by combining all scan pages into one file format, e.g. as BigTIFF or PDF/A. However, there may be good reasons not to include all pages in one file. What next? The remaining option is to add a file describing the structure of the digitized material in addition to the TIFF files. This can be a METS XML file, for example. METS is a good choice because it was created for this very purpose. Hmmm, is METS not a metadata format? And doesn't metadata belong outside of the payload? And isn't METS used by several archive information systems to map the AIPs? So can I not pack the structuring data into it?

Stop!

It is true, METS is a metadata format. And it is true that METS is often used to describe container structures in SIPs or AIPs. But we have to distinguish between metadata describing the IE (i.e. the payload) and metadata inherently belonging to the payload. This is not easy, but here the significant properties help us: If the METS is used, as in our example, to represent the significant property "pagination", then the METS is part of the IE, otherwise it is not.

Now you might be tempted to get sloppy and just put the "pagination" into the METS of the AIP. Is that a good idea? No. Because IE should be kept available and usable. The AIP should only contain the metadata necessary to ensure availability. But when a user later accesses the payload via DIP, he should have everything together, i.e.: an intellectual unit as it was actually intended. This is the principle of independence.

I admit that sounds abstract and difficult. But let us try an analogy. If I have loose pages where the order is important, then the order is important, whether the page is archived or not. For example, I tie them to a book or use other techniques. This is my intellectual unit that I want to archive. I put the whole thing in a box and write on it what is in it and what happened to the box or the content during archiving. This is then my AIP. If I want to hand over the contents of this box to someone later, they don't necessarily have to be interested in what happened to the box, they can take the contents and work with them and know exactly in which order the pages follow each other.

Example 2 - Web page

I would like to present a second example to illustrate another aspect. Let us assume that we are to archive a very specific web page, which for the sake of simplicity consists of an HTML document, CSV files and graphic files. If you look at the web page, there is always a link in the text between one of the CSV files and one graphic file. The assignment could be the visualization of an experiment. It is only important to the department that the values, the textual content and the assignment to the graphic are not lost. Together with the department we determined the significant properties and after a lot of effort we transferred the website (IE) into the long-term archive. After some time we found out that the graphic files were subject to format obsolescence and had to be migrated to a new format. We decide on the new image archive format PNG/A and migrate the old files.

But is this sufficient? No. The HTML document still contains the file name of the old format. Should we change the file name or leave it as it is? The principle of least surprise speaks for "change". But if we change the file names during the migration, we impossibly have to change the file names in the HTML document as well.

Let's summarize

Significant properties belong at the level of IE recorded. They are not file dependent.
Metadata, which is essential to represent the relationship of objects within an IE, is mandatory part of an IE
Format migrations can result in changes to other parts of the IE, even if they are not migrated themselves
Metadata and data that are inside an IE must never refer to data or metadata outside
Metadata outside of an IE, however, may already reference metadata and data of an IE.

Whew, that was a lot of thinking, but I hope it was worth thinking about it.

Format recognition, new analysis options?

2020-07-22T08:57:00.002-07:00

Previous work

In an older article (see https://kulturreste.blogspot.com/2018/10/heres-tool-make-it-work.html) I have already done an analysis of PRONOM signatures. Since today the module for this exists on CPAN, see https://metacpan.org/pod/File::FormatIdentification::Pronom for details.

In addition to the statistics on PRONOM signatures, the Perl package comes with two more helper scripts that can make the work of a long-term archivist easier.

Format identification

On the one hand, we have the functionality of classic format recognition. The script delivers all hits. In the output the quality of the RegEx is indicated. This does not say how well the PRONOM signature matches the file, but how specifically it is created.

Here is an example output for a TIFF file, which was wrongly recognized as GeoTIFF by Droid:

perl -I lib bin/pronomidentify.pl -s DROID_SignatureFile_V96.xml -b /tmp/00000007.tif

/tmp/00000007.tif identified as Tagged Image File Format with PUID fmt/353 (regex quality 1)
/tmp/00000007.tif identified as Geographic Tagged Image File Format (GeoTIFF) with PUID fmt/155 (regex quality 2)

Colorized output of possible signature hits in the hexeditor wxHexEditor

Under Linux you can use the editor wxHexEditor to analyze files. It allows you to create tag-files, in which you can define sections that are marked with colors and annotated.

The script pronom2wxhexeditor creates such a file. In the following you can see the call and a screenshot.

perl -I lib bin/pronom2wxhexeditor.pl -s DROID_SignatureFile_V96.xml -b /tmp/00000007.tif

What next?

Well, it's up to us as a community to use the existing tools and use their possibilities to improve our daily work. Anyone who has suggestions for improvement or ideas is welcome to share them with us.

I would be especially happy if servant spirits would take the pronoun statistics to their chest and help improve the pronoun signatures.

It makes sense to start with the orphaned signatures and to check multiple used signatures again.

Why it is a stupid idea to consider CSV as a valid long-term preservation file format

2020-07-13T05:05:00.001-07:00

Take CSV!

It's so nice and quick and easy to say. Take CSV!

For simple cases that may be true. CSV files look so simple, so innocent, so sweet. Yet by their very nature they are insidious, vicious, and resemble a bloody walk into the deepest dungeons of classic role-players.

Let us begin our journey.

Innocent simplicity

You take a separator, e.g. the comma, use it to separate your values. Pour both into readable form. Done.

Okay. We need a second separator to show us the next line. But then, done! It's a CSV.

Hmm. There was something. Line separator. Now, is that line feed, carriage return or carriage return and line feed? It depends. For example, what operating system you're running.

The monster is growing

It is not a bad idea to separate values of a list by commas. Especially for Americans, this feels quite natural.

In other parts of the world, the decimal places of fractional numbers are separated by commas. Good, then we'll give the spreadsheets the opportunity to define the separator freely. Problem solved.

Well, not quite. It could be in other contexts that somehow the separator could appear in the individual values of a list. Good, then we'll introduce quoting. We define a character that allows us to recognize whether a separator is a separator or just a text component of a list value. Apostrophes would fit. That was easy, wasn't it?

Short break

So, to sum up. CSV files are easy. You need a separator, which can be a comma or anything else. We have a second separator that separates the lines. Usually there are three variations. We need quoting to see that a value cannot be confused with a separator.

Yeah, it may have been a little more complex than it looked at first. But what is there to make it worse?

Little toothy pegs!

Hmm, what if I want to store a text like this as a value after the raw value 1:

And he said "Oh, no!"

In the text, we have a comma, which would be protected by quoting, But we also have quotation marks, which we need for our quoting. No problem, then we double the quotation mark at that point to indicate that the text is not finished. So in the CSV it looks like this now:

1, "And he said ""Oh, no!""

I got it.

But, wait, what happens if my text consists of a single quotation mark?

1,""""

You're lucky. It seems to be working.

Wait, so what if I have a lot of quotation marks? As in

""""""

This is translated to

1, """"""""""""""

It works, too.

The problem is in the details

Now, a nasty little devil might get the idea to construct a text as value that contains line breaks, for example this one:

Evil Text
",
",

That would then:

1, "Evil text
"","
"",

Oops! If I now stubbornly read this in line by line, I would have read strange lines.

Good thing there is real software out there that reads and parses CSV files cleanly from the beginning. Not that anyone here still uses 'grep' and co.

The Abyss

Have we actually talked about character encoding yet? ASCII, Latin-1, UTF32? UTF8? With or without byte-order mark? No. Let's turn back. We still have a chance.

Later, at the pub.

I admit it was a terrible trip. Now, over a cold beer, we can laugh about it. But our hearts were already in our mouth. We had no idea what to expect.

If only there had been a sign that said what character encoding, what line end encoding, what separators for lines and columns we could expect, yes, then we would have been able to understand CSV and we would have been spared the horror. But the horror comes from the darkness, from the premonitions of the unknown.

Therefore, be warned!

Don't use CSV, it could get you!

format zoo for videos - a bad idea in digital preservation

2020-02-18T02:26:00.002-08:00

Background

In an article on https://axfelix.github.io/ffv1, reasons are given not to apply the existing normalization of born-digital videos to FFV1, but to convert to lossy codecs instead. Elsewhere I even heard that normalization is not applied at all because it requires so many resources.

Why is normalization a good idea after all?

Normalization ensures that a manageable set of file formats remains from the huge format zoo, which can be handled well in the future. Normalization therefore reduces the organizational complexity above all.

And why should you use Matroska/FFV1?

FFV1 has the disadvantage of imposing higher storage requirements on its users, but in my opinion, the following points outweigh it:

FFV1 is much less complex than h264 (read "reduced technical complexity")
FFV1 (like other lossless codecs) allows automatic format migration (see also RAWcooked) — this reduces organizational complexity
FFV1 is freely available, widely used, well documented and standardized

The point that FFV1 is also more resistant to bit rot is just the icing on the cake.

Summary

Incidentally, personnel cost is the cost driver in digital preservation, as opposed to the pure storage cost.

Hence, the ultimate question is: how expensive is storage capacity in relation to the reduced technical and organizational complexity?

Legacy media

2019-05-29T01:43:00.000-07:00

This is the reason why you have to pay special attention to legacy digital media. Defective tracks of a floppy disk, special hardware (and knowledge) is necessary here.

Vorsicht vor Bitfischchen - Bestandserhaltung im digitalen Zeitalter

2019-04-01T00:08:00.000-07:00

Schädlingsbekämpfung ist ein immerwährendes Problem in Bibliotheken und Archiven. Silberfischchen, Papierfischchen und andere Übeltäter laben sich an den Beständen und richten dabei beträchtliche Schäden an.

Da die Schädlingsbekämpfung nicht als explizite Aufgabe im OAIS-Referenzmodell aufgeführt ist, haben einige digitale Langzeitarchive hier bisher deutliche Defizite. Inzwischen spüren aber auch diese Einrichtungen immer deutlicher, dass die Schädlingsbekämpfung nicht vernachlässigt werden darf.

Angelockt von umfangreichen digitalen Beständen nisten sich Bitfischchen und Käfer (in der Fachsprache "Bugs" genannt) in Kabelhaufen ein und vermehren sich dort ungestört. Das Nahrungsangebot durch den reichlich vorhandenen Kabelsalat ist gut, und so wachsen die Populationen schnell an. Reste von Junk sowie Binärmüll-Krümel verschärfen das Problem zusätzlich.

Nicht nur die Anzahl der Fischchen, sondern auch ihre lange Lebensdauer ist ein Problem. Viele von Ihnen werden acht bis zehn Jahre alt, Microfichechen sogar noch deutlich älter.

Im moderigen Milieu vieler digitaler Archive fühlen sich auch Magnetbandwürmer wohl, die sich vor allem an den Daten auf WORM-Tapes laben. Daten, die nicht von den kleinen Plagegeistern zerstört werden, zerfallen in der fauligen Umgebung durch den Bitrot zu unlesbarem Datenkompost, der die Datenleitungen verstopft und so die Verarbeitung stört.

Eine gute Seite hat die neue Plage allerdings: findige Informatiker haben herausgefunden, dass Bitfischchen hervorragend zur Herstellung von Bitfett geeignet sind. Sie nutzen es, um Leitungsverbindungen zu schmieren und so die Reibung bei der Datenübertragung zu reduzieren, was wiederum positiv auf den Durchsatz auswirkt.

Here's a tool, make it work!

2018-10-05T02:06:00.003-07:00

In the last post you may have already noticed it. To analyze the hits of DROID signatures I wrote a small Perl script which converts Droid signatures into Perl Regular Expressions and writes the matches into tag files of the hex editor wxHexEdit so that you can see which signatures were used where in a file.

From this small script a bigger Perl module called "File::FormatIdentification::Pronom" was created. It should not replace Droid, Fido or Siegfried. It only serves to analyze which patterns can be optimized and gives statistics about how to improve the Pronom database in the future.
In the following a statistic of the current Droid signature is shown, so that you get a feeling, what is possible.

perl -I lib/ bin/pronom_statistics.pl ../DROID_SignatureFile_V94.xml
Statistics of file ../DROID_SignatureFile_V94.xml
=======================================

Countings
---------------------------------------
Count of PUIDs:                        1670
         internal IDs:                 1441
         regular expressions:          1730
         file endings:                 1167
         PUIDs with file endings only: 503
         (56,76,167,168,169,194,195,212,594,681,682,683,684,691,717,760,780,879,996,1435)
         orphaned internal IDs:        20
         (56,76,167,168,169,194,195,212,594,681,682,683,684,691,717,760,780,879,996,1435)

Quality of internal IDs
---------------------------------------
1-best quality internal ID (PUID, name):       110 (fmt/75, Drawing Interchange File Format (ASCII)) -> 4.882;3.135
        combined regex: (?=((\x0A)|(\x0D\x0A)(0))SECTION((\x0A)|(\x0D\x0A)(\x20\x202)((\x0A)|(\x0D\x0A)(HEADER)((\x0A)|(\x0D\x0A))))((\x0A)|(\x0D\x0A)(9))\$ACADVER((\x0A)|(\x0D\x0A)(\x20\x201)((\x0A)|(\x0D\x0A)(AC1009)((\x0A)|(\x0D\x0A))))((\x0A)|(\x0D\x0A)(0))ENDSEC((\x0A)|(\x0D\x0A)))(?=(((\x0A)|(\x0D\x0A)(0))EOF((\x0A)|(\x0D\x0A)))\Z)
2-best quality internal ID (PUID, name):       105 (fmt/70, Drawing Interchange File Format (ASCII)) -> 4.736;2.833
        combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC((1001)|(2\x2E21)|(2\x2E22)(\x0D\x0A))0
ENDSEC
)(?=(0\x0D\x0AEOF\x0D\x0A)\Z)
3-best quality internal ID (PUID, name):       104 (fmt/69, Drawing Interchange File Format (ASCII)) -> 4.644;2.833
        combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC2\x2E10\x0D\x0A0\x0D\x0AENDSEC\x0D\x0A)(?=(0\x0D\x0AEOF\x0D\x0A)\Z)
4-best quality internal ID (PUID, name):       103 (fmt/68, Drawing Interchange File Format (ASCII)) -> 4.644;2.833
        combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC1\x2E50\x0D\x0A0\x0D\x0AENDSEC\x0D\x0A)(?=(0\x0D\x0AEOF\x0D\x0A)\Z)
5-best quality internal ID (PUID, name):       102 (fmt/67, Drawing Interchange File Format (ASCII)) -> 4.644;2.833
        combined regex: (?=0\x0D\x0ASECTION\x0D\x0A\x20\x202\x0D\x0AHEADER\x0D\x0A9\x0D\x0A\x24ACADVER\x0D\x0A\x20\x201\x0D\x0AAC1\x2E40\x0D\x0A0\x0D\x0AENDSEC\x0D\x0A)(?=(0\x0D\x0AEOF\x0D\x0A)\Z)

1-worst quality internal ID (PUID, name):       1299 (fmt/950, MIME Email) -> -1.993;-2.91;-2.776;-2.776;-2.29
        combined regex: (?=\A.{0,16384}(((V)|(v)(\x2D)((IME)|(ime)(M)))ersion: 1\.0))(?=\A.{0,16384}(To\x3A\x20))(?=\A.{0,16384}(From\x3A\x20))(?=\A.{0,16384}(Date\x3A\x20))(?=\A.{0,16384}(Content\x2DType\x3A\x20))
2-worst quality internal ID (PUID, name):       527 (fmt/358, Internet Data Query File) -> -2.806;-2.743;-2.629;-2.981
        combined regex: (?=\A.{0,3424}(\x5BQuery\x5D).*(((S)|(s)(i)((C)|(c)))cope=))(?=\A.{0,3424}(\x5BQuery\x5D).*(((C)|(c)(i)((C)|(c)))olumns=))(?=\A.{0,3424}(\x5BQuery\x5D).*(((T)|(t)(i)((C)|(c)))emplate=\/))(?=\A.{0,3424}(\x5BQuery\x5D).*(((R)|(r)(i)((C)|(c)))estriction=.?(\x25)))
3-worst quality internal ID (PUID, name):       532 (fmt/363, SEG Y Data Exchange Format) -> -3.351;-4.196
        combined regex: (?=\A.{0,320}(\x40{22}))(?:(?=\A.{3200}(\x00\x00.{15}([^\x00])|(?=\A.{3600}(\x00\x00.{15}([^\x00])).{3}([^\x00])(.{2}(\x00[\x01-\x08])|.{2}(\x01\x00))))
4-worst quality internal ID (PUID, name):       533 (fmt/363, SEG Y Data Exchange Format) -> -3.351;-4.196
        combined regex: (?=\A.{0,320}(\x40{22}))(?:(?=\A.{3200}(\x00\x00.{15}([^\x00])|(?=\A.{3600}(\x00\x00.{15}([^\x00])).{3}([^\x00])(.{2}(\x00[\x01-\x08])|.{2}(\x01\x00))))
5-worst quality internal ID (PUID, name):       835 (fmt/532, Drawing Interchange File Format (ASCII)) -> -3.614;-3.842
        combined regex: (?=\A.{1,3}((0).{1,2}SECTION.{1,2}(\x20\x202).{1,2}(HEADER)).+((9).{1,2}\$ACADVER.{1,2}(\x20\x201).{1,2}(AC1027)).+((0).{1,2}ENDSEC))(?=((0).{1,2}EOF).{1,3}\Z)


Regular expressions
---------------------------------------
Count of multiple used regular expressions: 67
         common regex group no 0:
            regex='(((\x0A)|(\x0D)|(\x0D\x0A)(0))EOF).{0,2}\Z'
            internal IDs: 111,112,113

[…]

I would be pleased about feedback. The code is available under http://andreas-romeyke.de/software.html#_file_formatidentification_pronom .

Have fun!

A file is a TIFF is a MP3 is a…

2018-09-17T07:43:00.001-07:00

In den letzten Tagen sind uns einige Dateien aufgefallen, die in der Formatidentifizierung hängengeblieben sind. Diese wurden von Droid als TIFF (fmt/353) und als MP3 (fmt/134) erkannt.

Die Frage, die sich uns stellte: Lag ein Fehler vor, oder handelt es sich tatsächlich um Dateien, die man anhand der Pronom-Signaturen sowohl als TIFF als auch als MP3 interpretieren könnte?

Um diese genauer zu untersuchen, haben wir uns ein Perl-Script¹ geschrieben. welches die Muster aus der Droid-Signaturen Datei verwendet und die entsprechenden Treffer im HexEditor sichtbar macht. Hier ein Screenshot:


wxHexeditor, Screenshot mit spezieller Tags-Datei

Wie man sieht, treffen mehrere Muster. Zum einen das Muster für TIFF-Dateien, indem am Anfang der Magicbyte-String "0x4949" vorkommt. Zum anderen auch eines der Rezepte, die einen MP3-Datenstrom beschreiben.

Bei Wikipedia findet man unter XXX folgende Darstellung eines MP3-Frames. Das Muster in der Droid-Signatur trifft, da 8 Frames hintereinander vorkommen:


MP3-Struktur, Quelle: Wikipedia, sh. https://commons.wikimedia.org/wiki/File:Mp3filestructure.svg (CC-BY/GFDL)

Diese Datei ist ein gutes Beispiel dafür, daß nicht die Muster in der Pronom-Datenbank das Problem sind, sondern dateiformat-spezifische Eigenschaften es notwendig machen, den Ingest-Prozess so zu gestalten, dass dieser mit mehreren Treffern in der Formatidentifikation umgehen kann.

Siehe hierzu auch unser Beitrag "Formatidentifikation vs. Formatvalidierung - Wem glauben wir eigentlich?" unter https://kulturreste.blogspot.com/2016/06/formatidentifikation-vs.html

--
¹ Das Perlscript stellen wir demnächst zur Verfügung

Wie verwirrend! How confusing! Defaults in TIFF

2018-04-16T04:57:00.004-07:00

Hint: english version below :)

Erste Überlegung: Hä?

Ernsthaft? Was soll denn an den Defaults von TIFF so problematisch sein? Steht doch alles in der Spezifikation. Es gilt:

Enthält ein TIFF ein Tag nicht, für das ein Default definiert ist, gilt der Default.
Wenn ein TIFF ein Tag enthält, gilt der Wert des Tags.
Sonst gilt, der Wert ist nicht definiert und demnach nicht vorhanden.

Der zweite Blick

Leider ist es in der Praxis komplizierter. Ich bekam die Frage, wenn jhove bei der Prüfung der von checkit_tiff mitgelieferten Beispiel-TIFFs für das Thresholding-Tag 263 den Wert "1" ausgibt:

$> jhove tiffs_should_pass/minimal_valid_baseline.tiff
Jhove (Rel. 1.6, 2011-01-04)
Date: 2018-04-16 12:41:25 MESZ
RepresentationInformation: tiffs_should_pass/minimal_valid_baseline.tiff
ReportingModule: TIFF-hul, Rel. 1.5 (2007-10-02)
LastModified: 2017-07-14 11:28:57 MESZ
Size: 323
Format: TIFF
Version: 5.0
Status: Well-Formed and valid
SignatureMatches:
   TIFF-hul
MIMEtype: image/tiff
Profile: Baseline bilevel (Class B), TIFF/IT-BP (ISO 12639:1998), TIFF/IT-BP/P1 (ISO 12639:1998), TIFF/IT-BP/P2 (ISO 12639:1998), TIFF/IT-MP (ISO 12639:1998)
TIFFMetadata:
   ByteOrder: little-endian
   IFDs:
    Number: 1
    IFD:
     Offset: 38
     Type: TIFF
     Entries:
      NisoImageMetadata:
       ByteOrder: little_endian
       CompressionScheme: uncompressed
       ImageWidth: 20
       ImageHeight: 10
       ColorSpace: white is zero
       Orientation: normal
       SamplingFrequencyUnit: inch
       XSamplingFrequency: 376,193
       YSamplingFrequency: 376,193
       BitsPerSample: 1
       BitsPerSampleUnit: integer
       SamplesPerPixel: 1
      NewSubfileType: 0
      SampleFormat: 1
      MinSampleValue: 0
      MaxSampleValue: 1
      Threshholding: 1
      TIFFITProperties:
       BackgroundColorIndicator: background not defined
       ImageColorIndicator: image not defined
       TransparencyIndicator: no transparency
       PixelIntensityRange: 0, 1
       RasterPadding: 1 byte
       BitsPerRunLength: 8
       BitsPerExtendedRunLength: 16

aber checkit_tiff mit dem beigefügten Beispiel keinen Fehler wirft, obwohl doch keine Positiv-Regel in der Konfigurationsdatei hinterlegt ist:

$> checkit_tiff example_configs/cit_tiff6_baseline_SLUB.cfg tiffs_should_pass/minimal_valid_baseline.tiff
'./build/checkit_tiff' version: development_v0.4.0
    revision: 408
licensed under conditions of libtiff (see http://libtiff.maptools.org/misc.html)
cfg_file=example_configs/cit_tiff6_baseline_SLUB.cfg
tiff file/dir=tiffs_should_pass/minimal_valid_baseline.tiff
file: tiffs_should_pass/minimal_valid_baseline.tiff
(./)    general    --> TIFF should have just one IFD, (lineno: 12)
(./)    general    --> All tag offsets should be word aligned, (lineno: 14)
(./)    general    --> All offsets may only be used once, (lineno: 14)
(./)    general    --> All tag offsets should be greater than zero, (lineno: 14)
(./)    general    --> All IFDs should be word aligned, (lineno: 15)
(./)    general    --> Tags should be sorted in ascending order, (lineno: 15)
(./)    tag 256 (ImageWidth)    --> Tag should have a value in a range of (lineno: 23)
(./)    tag 257 (ImageLength)    --> Tag should have a value in a range of (lineno: 25)
(./)    tag 258 (BitsPerSample)    --> One or more conditions needs to be combined in a logical_or operation (open) (lineno: 30)
(./)    tag 259 (Compression)    --> Tag should have one exact value. (lineno: 36)
(./)    tag 262 (Photometric)    --> Tag should have a value in a range of (lineno: 40)
(./)    tag 273 (StripOffsets)    --> TIFF should contain this tag. (lineno: 45)
(./)    tag 277 (SamplesPerPixel)    --> Tag should have one exact value. (lineno: 52)
(./)    tag 278 (RowsPerStrip)    --> Tag should have a value in a range of (lineno: 55)
(./)    tag 279 (StripByteCounts)    --> TIFF should contain this tag. (lineno: 60)
(./)    tag 282 (XResolution)    --> Tag should have a value in a range of (lineno: 63)
(./)    tag 283 (YResolution)    --> Tag should have a value in a range of (lineno: 66)
(./)    tag 296 (ResolutionUnit)    --> Tag should have one exact value. (lineno: 69)
(./)    tag 254 (SubFileType)    --> One or more conditions needs to be combined in a logical_or operation (open) (lineno: 77)
(./)    tag 274 (Orientation)    --> Tag should have one exact value. (lineno: 113)
(./)    tag 284 (PlanarConfig)    --> Tag should have one exact value. (lineno: 122)
(./)
(./)Yes, the given tif is valid :)

Zuerst war ich etwas erschrocken, war ich mir doch sicher, dass checkit_tiff funktioniert und ich alles sorgfältig geprüft hatte. Zur Sicherheit habe ich die Ausgabe mit tiffdump der libtiff geprüft:

$> tiffdump tiffs_should_pass/minimal_valid_baseline.tifftiffs_should_pass/minimal_valid_baseline.tiff:
Magic: 0x4949 <little-endian> Version: 0x2a <ClassicTIFF>
Directory 0: offset 38 (0x26) next 0 (0)
SubFileType (254) LONG (4) 1<0>
ImageWidth (256) SHORT (3) 1<20>
ImageLength (257) SHORT (3) 1<10>
BitsPerSample (258) SHORT (3) 1<1>
Compression (259) SHORT (3) 1<1>
Photometric (262) SHORT (3) 1<0>
StripOffsets (273) LONG (4) 1<8>
Orientation (274) SHORT (3) 1<1>
SamplesPerPixel (277) SHORT (3) 1<1>
RowsPerStrip (278) SHORT (3) 1<64>
StripByteCounts (279) LONG (4) 1<30>
XResolution (282) RATIONAL (5) 1<376.193>
YResolution (283) RATIONAL (5) 1<376.193>
PlanarConfig (284) SHORT (3) 1<1>
ResolutionUnit (296) SHORT (3) 1<2>

Gut, tiffdump war auf meiner Seite. Was ist also der Grund für diese Diskrepanz? Schauen wir zuerst in die TIFF-6.0 Spezifikation, dort steht auf Seite 41:

For black and white TIFF files that represent shades of gray, the technique used to
convert from gray to black and white pixels.
Tag = 263 (107.H)
Type = SHORT
N = 1
1 = No dithering or halftoning has been applied to the image data.
2 = An ordered dither or halftone technique has been applied to the image data.
3 = A randomized process such as error diffusion has been applied to the image data.
Default is Threshholding = 1. See also CellWidth, CellLength.

Okay. Für das oben benutzte TIFF trifft zu, dass es schwarz-weiß ist und kein Tag 263 enthält. Daher wird der Default = 1 angenommen.

Jhove präsentiert die Metadaten der TIFF-Dateien also so, wie ein TIFF-Reader sie interpretieren würde. Die Tools checkit_tiff und tiffdump zeigen dagegen, welche TIFF-Tags mit welchen Werten tatsächlich in den TIFF-Dateien explizit kodiert sind.

Fazit

Kenne Deine Tools! Statt Default-Werte zu interpretieren, sollten solche Annahmen explizit gekennzeichnet werden. Für den Durchschnittsanwender ist sonst nicht ersichtlich, wie die Ergebnisse zustande kommen. Als Lektion für checkit_tiff nehme ich diese Frage mit in die FAQ auf.

First thought: WTF?

Seriously? What's supposed to be so problematic about TIFF's defaults? After all, the Spezifikation says it all. The rules are:

If a TIFF does not contain a tag that has a well-defined default value, then that default value is used.
If a TIFF does contain a tag, then that tag's value is used.
In all other cases, the value is undefined and hence nonexistent.

Der zweite Blick

Unfortunately, the real world is a little more complicated. I was asked why jhove would give a value of "1" for the Thresholding tag 263 when validating TIFF-examples that are delivered with checkit_tiff as shown below:

$> jhove tiffs_should_pass/minimal_valid_baseline.tiff
Jhove (Rel. 1.6, 2011-01-04)
Date: 2018-04-16 12:41:25 MESZ
RepresentationInformation: tiffs_should_pass/minimal_valid_baseline.tiff
ReportingModule: TIFF-hul, Rel. 1.5 (2007-10-02)
LastModified: 2017-07-14 11:28:57 MESZ
Size: 323
Format: TIFF
Version: 5.0
Status: Well-Formed and valid
SignatureMatches:
   TIFF-hul
MIMEtype: image/tiff
Profile: Baseline bilevel (Class B), TIFF/IT-BP (ISO 12639:1998), TIFF/IT-BP/P1 (ISO 12639:1998), TIFF/IT-BP/P2 (ISO 12639:1998), TIFF/IT-MP (ISO 12639:1998)
TIFFMetadata:
   ByteOrder: little-endian
   IFDs:
    Number: 1
    IFD:
     Offset: 38
     Type: TIFF
     Entries:
      NisoImageMetadata:
       ByteOrder: little_endian
       CompressionScheme: uncompressed
       ImageWidth: 20
       ImageHeight: 10
       ColorSpace: white is zero
       Orientation: normal
       SamplingFrequencyUnit: inch
       XSamplingFrequency: 376,193
       YSamplingFrequency: 376,193
       BitsPerSample: 1
       BitsPerSampleUnit: integer
       SamplesPerPixel: 1
      NewSubfileType: 0
      SampleFormat: 1
      MinSampleValue: 0
      MaxSampleValue: 1
      Threshholding: 1
      TIFFITProperties:
       BackgroundColorIndicator: background not defined
       ImageColorIndicator: image not defined
       TransparencyIndicator: no transparency
       PixelIntensityRange: 0, 1
       RasterPadding: 1 byte
       BitsPerRunLength: 8
       BitsPerExtendedRunLength: 16

However, checkit_tiff does not throw an error while validating the same sample file, even though there's no whitelist rule for that tag in the config file:

$> checkit_tiff example_configs/cit_tiff6_baseline_SLUB.cfg tiffs_should_pass/minimal_valid_baseline.tiff
'./build/checkit_tiff' version: development_v0.4.0
    revision: 408
licensed under conditions of libtiff (see http://libtiff.maptools.org/misc.html)
cfg_file=example_configs/cit_tiff6_baseline_SLUB.cfg
tiff file/dir=tiffs_should_pass/minimal_valid_baseline.tiff
file: tiffs_should_pass/minimal_valid_baseline.tiff
(./)    general    --> TIFF should have just one IFD, (lineno: 12)
(./)    general    --> All tag offsets should be word aligned, (lineno: 14)
(./)    general    --> All offsets may only be used once, (lineno: 14)
(./)    general    --> All tag offsets should be greater than zero, (lineno: 14)
(./)    general    --> All IFDs should be word aligned, (lineno: 15)
(./)    general    --> Tags should be sorted in ascending order, (lineno: 15)
(./)    tag 256 (ImageWidth)    --> Tag should have a value in a range of (lineno: 23)
(./)    tag 257 (ImageLength)    --> Tag should have a value in a range of (lineno: 25)
(./)    tag 258 (BitsPerSample)    --> One or more conditions needs to be combined in a logical_or operation (open) (lineno: 30)
(./)    tag 259 (Compression)    --> Tag should have one exact value. (lineno: 36)
(./)    tag 262 (Photometric)    --> Tag should have a value in a range of (lineno: 40)
(./)    tag 273 (StripOffsets)    --> TIFF should contain this tag. (lineno: 45)
(./)    tag 277 (SamplesPerPixel)    --> Tag should have one exact value. (lineno: 52)
(./)    tag 278 (RowsPerStrip)    --> Tag should have a value in a range of (lineno: 55)
(./)    tag 279 (StripByteCounts)    --> TIFF should contain this tag. (lineno: 60)
(./)    tag 282 (XResolution)    --> Tag should have a value in a range of (lineno: 63)
(./)    tag 283 (YResolution)    --> Tag should have a value in a range of (lineno: 66)
(./)    tag 296 (ResolutionUnit)    --> Tag should have one exact value. (lineno: 69)
(./)    tag 254 (SubFileType)    --> One or more conditions needs to be combined in a logical_or operation (open) (lineno: 77)
(./)    tag 274 (Orientation)    --> Tag should have one exact value. (lineno: 113)
(./)    tag 284 (PlanarConfig)    --> Tag should have one exact value. (lineno: 122)
(./)
(./)Yes, the given tif is valid :)

Being sure that checkit_tiff works as expected and that I had checked everything, I was shocked at first. To err on the side of safety, I ran a crosscheck of checkit_tiff's output with the output of the tiffdump tool from the libtiff:

$> tiffdump tiffs_should_pass/minimal_valid_baseline.tifftiffs_should_pass/minimal_valid_baseline.tiff:
Magic: 0x4949 <little-endian> Version: 0x2a <ClassicTIFF>
Directory 0: offset 38 (0x26) next 0 (0)
SubFileType (254) LONG (4) 1<0>
ImageWidth (256) SHORT (3) 1<20>
ImageLength (257) SHORT (3) 1<10>
BitsPerSample (258) SHORT (3) 1<1>
Compression (259) SHORT (3) 1<1>
Photometric (262) SHORT (3) 1<0>
StripOffsets (273) LONG (4) 1<8>
Orientation (274) SHORT (3) 1<1>
SamplesPerPixel (277) SHORT (3) 1<1>
RowsPerStrip (278) SHORT (3) 1<64>
StripByteCounts (279) LONG (4) 1<30>
XResolution (282) RATIONAL (5) 1<376.193>
YResolution (283) RATIONAL (5) 1<376.193>
PlanarConfig (284) SHORT (3) 1<1>
ResolutionUnit (296) SHORT (3) 1<2>

Well, tiffdump was in my team there. So, what's the reason for that discrepancy? First, let's have a loot at the TIFF-6.0 Spezifikation. On page 41, the specification states:

For black and white TIFF files that represent shades of gray, the technique used to
convert from gray to black and white pixels.
Tag = 263 (107.H)
Type = SHORT
N = 1
1 = No dithering or halftoning has been applied to the image data.
2 = An ordered dither or halftone technique has been applied to the image data.
3 = A randomized process such as error diffusion has been applied to the image data.
Default is Threshholding = 1. See also CellWidth, CellLength.

Okay. Looking at the sample TIFF we used above, it's true that it's a black-and-white image and does not contain tag 263. Hence, a default = 1 is assumed.

Apparently, Jhove will present the metadata in the TIF files in a way that a TIF reader would interpret them. The tools checkit_tiff and tiffdump however show which TIF tags are actually explicitely encoded in the TIFFs and what values they have.

Wrap-up

Know your tools!Instead of interpreting default values, these kinds of exceptions need to be cleary marked. Otherwise, the genesis of these results might not be apparent to the average user.
I have learned learned my lesson and will include this question into the checkit_tiff FAQ.

Valid TIFFs need love, too.

2018-02-26T23:41:00.002-08:00

(english version below)

Über einen Kollegen haben wir ein interessantes TIFF erhalten. Es hatte alle Validierungen bestanden und zeigte keine strukturellen Fehler in tiffinfo/tiffdump, ließ sich aber trotzdem im Vorschaubetrachter des Workflowtools nicht anzeigen. Außerdem war es ca. dreimal so groß wie alle anderen Scans aus dem gleichen Vorgang. Er bat uns, das TIFF zu untersuchen.

Im Gegensatz zu ihm habe ich keine Probleme damit gehabt, das TIFF überhaupt zu öffnen; der Windows-Bildbetrachter, IrfanView, MS Paint, Paint.NET und XnViewMP stellten alle das Bild dar. Allerdings war es in der Horizontalen stark gestreckt, d.h. deutlich breiter als erwartet. Große Teile des Bildinhaltes (eine gescannte Zeitschriftenseite) fehlten, und der rechte Rand war nicht sichtbar.

kaputte Anzeige des TIFFs

In tiffinfo sahen wir, dass das TIFF ein Grayscale-Image ist:

Bits/Sample: 8

Samples/Pixel: 1

Auffällig war, dass die Listeneinträge für StripByteCounts genau um Faktor 3 größer als die ImageWidth waren (4302 * 3 = 12906); das erklärte die Streckung des Bildes in X-Richtung. Man sah außerdem, dass die StripOffsets in Schritten von 12906 Bytes anwuchsen; vermutlich war der Viewer deswegen überhaupt in der Lage, irgendein Bild anzuzeigen. Die ImageLength stimmte mit der Anzahl der Einträge in StripByteCount überein (6020), deshalb gab es hier keine Verzerrung.

Image Width: 4302

Image Length: 6020

StripByteCounts (279) LONG (4) 6020<12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 12906 ...> StripOffsets (273) LONG (4) 6020<8 12914 25820 38726 ...>

In Okteta konnten wir sehen, dass die Bilddaten für jedes Pixel dreimal identisch gespeichert waren. Das deckt sich der Aussage des Kollegen, dass das Bild ca. dreimal größer war als alle anderen Scans im gleichen Vorgang. Außerdem haben wir gesehen, dass das IFD0 am Dateiende stand und Hinweise auf Bearbeitungen mit IrfanView enthielt.

normales RGB-TIFF

defektes TIFF mit zwei Bytes redundanten Grayscale-Daten je Pixel

Nachdem wir das Problem verstanden hatten, haben wir Reparaturmöglichkeiten diskutiert:

- Man könnte die Redundanz der Pixel entfernen und die StripOffsets (und wahrscheinlich noch andere Offsets) anpassen. Das wäre wahrscheinlich die sauberere Lösung, müsste aber definitiv mit Softwareunterstützung getan werden.

- Man könnte die SamplesPerPixel auf "3" setzen, um die drei duplizierten Bytes je Pixel als RGB-Kanäle zu interpretieren und damit drei Bytes zu einem Pixel im Bild zusammenzufassen. Das haben wir getan, und es hat funktioniert; zumindest war das Bild anzeigbar, nicht gestaucht und nicht in ausgefallene Farben getaucht.

Zur Ursache des Fehlers gab es nun zwei Theorien:

- Es könnte einen Bitflip gegeben haben, bei dem SamplesPerPixel beschädigt wurde: der Weg von "00 11"B ("0 3" D) zu "00 01"B ("0 1" D) ist nicht weit und würde das Fehlerbild erklären.

- Es könnte einen Fehler bei der Konvertierung eines RGB-Scans von einer Grayscale-Vorlage gegeben haben, bei dem die überzähligen Bytes pro Pixel nicht entfernt wurden. Das SamplesPerPixel Tag wäre dabei korrekt und absichtlich gesetzt worden.

Als erstes haben wir nun also SamplesPerPixel im Hex-Editor auf "3" gesetzt, um den TIFF-Viewer anzuweisen, die Bilddaten als RGB-Bild zu interpretieren. Schon diese kleine Änderung bewirkte, dass sich das Bild fehlerfrei anzeigen ließ. Der Umstand, dass das Bild ungewöhnlich groß war (wir hatten erwartet, dass es ähnlich groß wäre wie die anderen Scans aus der gleichen Zeitschrift), blieb aber vorerst ungeklärt.

defektes Grayscale-TIFF, als RGB interpretiert

korrekte Anzeige des TIFFs

Wir erwägen, eine Plausibilitätsprüfung für diesen Fehlertyp in checkit_tiff zu implementieren, sofern man davon ausgeht, dass innerhalb eines Bildes alle Strips gleich lang sind. Dazu verwendet man die Formel: "StripByteCounts / SamplesPerPixel / RowsPerStrip = ImageWidth". Am einfachsten funktioniert das mit TIFFs, bei denen RowsPerStrip = 1 ist; andernfalls müssen zusätzlich komplexere Prüfungen durchgeführt werden, weil bei mehrzeiligen Strips, deren Bytelänge nicht ohne Rest ganzzahlig durch die Zeilenanzahl teilbar ist, kein Padding angefügt wird. Dadurch können Rows entstehen, die kürzer sind als die vorderen Rows eines Strips.

Zusätzlich denkbare Plausibilitätsprüfungen wären:

- Die Höhe des Bildes ist genau so lang wie das Produkt aus RowsPerStrip und Anzahl der Strips: ImageLength = RowsPerStrip * StripOffsets.Count

- Jeder StripByteCount muss so groß sein wie die Differenz der dazugehörigen StripByteOffsets: StripByteCounts[0] = StripOffsets[1] - StripOffsets[0] (bzw. allgemeiner StripByteCounts[n] = StripOffsets[n+1] - StripOffsets[n])

- Jeder Strip muss gleich lang sein: StripByteCounts[0] = StripByteCounts[1] = StripByteCounts[2] = ... = StripByteCounts[n]

Diese Möglichkeiten haben wir im größeren Kreis diskutiert, was Andreas neugierig gemacht hat. Er hat also sein neues Tool zum Finden möglicher ehemaliger IFDs in TIFFs um einige weiche Suchkritierien erweitert und es genutzt, um IFDs aus früheren Dateiversionen zu finden. Außerdem hat er ein ganz neues Tool geschrieben, das eine TIFF-Datei und eine Adresse in Hex-Notation einliest und den Inhalt an dieser Adresse so interpretiert, als wäre dort ein IFD gespeichert. Auf diese Weise konnten wir insgesamt sechs frühere IFDs ermitteln, die auf ältere Versionen der Datei hinweisen, und den Inhalt dieser IFDs in Augenschein nehmen. Die Tools sind unter https://github.com/SLUB-digitalpreservation/fixit_tiff/tree/master/src/archeological_tools im Quellcode verfügbar; sie sind Teil des bekannten Tools fixit_tiff.

Pointer zum ursprünglichen IFD0, wie er in der ersten Version der Datei stand

Die Ausgabe möglicher IFD-Adressen sieht so aus:
# adress,weight,is_sorted,has_required_baseline
0x4a184b0,2,y,y
0x4a241aa,2,y,y
0x4a2fea4,2,y,y
0x4a3bbb0,2,y,y
0x4a478d0,2,y,y
0x4a535ea,2,y,y

Diese Adressen der IFDs haben wir mittels Hex-Editor als IFD0-Offset in die TIFF-Datei eingetragen und so in einer Art TIFF-Archäologie schrittweise die alten Versionen der Datei wieder hergestellt. Dabei bestätigte sich die Annahme, dass der Scan ursprünglich in RGB abgespeichert worden war. Danach wurde wohl eine fehlerhafte Grayscale-Konvertierung durchgeführt, bei der nur die Tags PhotometricInterpretation (min-is-black) und BitsPerSample (1) verändert wurden. Ob dabei auch die Bilddaten selbst verändert wurden, lässt sich nicht mehr genau rekonstruieren.

In der vermutlich ersten Version des IFD0 sieht man mit tiffinfo noch die Angaben zum RGB-Bild:

Photometric Interpretation: RGB color

Samples/Pixel: 3

Die späteren Fassungen dagegen enthalten die Werte:

Photometric Interpretation: min-is-black

Samples/Pixel: 1

Außerdem wurden noch einige weitere Versionen des TIFFs erzeugt, bei denen einige andere Tags verändert, hinzugefügt oder entfernt wurden (Make, Model und Software).

Der Fehler war überhaupt nur aufgefallen, weil es eine intellektuelle Prüfung gab und der Bearbeiterin der Anzeigefehler auffiel (und sie ihn dann auch gemeldet hat!). Weil außerdem die MD5-Summen erst am Ende der Bearbeitung generiert werden und damit zum Fehlerzeitpunkt noch keine Prüfsumme existierte, wäre der Fehler nicht durch einen Fixity-Mismatch aufgefallen. Die einzig saubere Lösung wird nun wohl sein, die Seite neu zu scannen. Trotzdem ist es aber sehr eindrucksvoll zu sehen, welche Möglichkeiten das TIF Format bietet, kaputte Dateien wiederherzustellen.

frühere Artikel zu diesem Thema (also available in English):

-------------------------------------------------------------------------------------------------------------------

english version

A few days ago, a colleague gave us an interesting TIFF. It had successfully completed all validation attempts and didn't show any signs of structural issues in tiffinfo/tiffdump. However, it was not possible to display the image in the preview of the workflow tool used. Also, it was about three times the size of the other scans in the same intellectual entity. Our colleague asked us to have a closer look at that TIFF, so we went at it.

In contrast to our colleague, I didn't have any problem in displaying the TIFF altogether; the Windows Image Viewer, IrfanView, MS Paint, Paint.NET und XnViewMP all displayed the image correctly. However, it was significantly stretched horizontally, which means that it was a lot wider than expected. Large parts of the scanned newspaper page were missing, and the rightmost part of the image was not visible.

broken display of the TIFF

In tiffinfo, we saw that the TIFF is a grayscale image:

Bits/Sample: 8

Samples/Pixel: 1

Particularly striking was the fact that the list entries for StripByteCounts was exactly by faktor 3 larger than the ImageWidth (4302 * 3 = 12906), which explained the stretch we saw in the image. Also, you could see that the StripOffsets grew in steps of 12906 Bytes; presumeably that's why the viewer was able to display a picture in the first place, regardless of the final quality. The ImageLength matched up with the number of entries in StripByteCount (6020), which is why there was no stretch in vertical direction.

Image Width: 4302

Image Length: 6020

We could see in Okteta that the image data for each pixel were saved identically three times in a row. That explains our colleagues information about the filesize being three times larger than the other files in that IE. Also, we noticed that the IFD0 was written to the end of the file and contained information about an editing step in IrfanView.

normal RGB-TIFF

defective TIFF with two Bytes of redundant grayscale data per pixel

After having understood the problem, we discussed possible ways to repair the file:

- We could remove the redundant pixels and adapt the StripOffsets (and quite possibly all other ofsets in that file). While this is the more proper solution, software support for this kind of work would be imperative.

- We could set SamplesPerPixel to"3" to interpret the three duplicate pixels each as three RGB channels, thus summarizing three Bytes into one pixel. We actually did that, and it worked like a charm; at least we could display the image without getting any stretching or funky colors.

Now we had two theories about the origin of this error:

- There might have been a bit flip that damaged SamplesPerPixel. It's not a long way to go from "00 11"B ("0 3" D) to "00 01"B ("0 1" D), and it would explain the error we're seing.

- There could have been an error during a conversion of an RGB scan that was made from an analog grayscale template, during which the unnecessary pixels have not been removed. During this conversion, the SamplesPerPixel tag would have been rightfully set to a new value.

In a first test we set SamplesPerPixel to "3" using a Hex editor in order to command the TIFF viewer to interpret the image data in an RGB fashion. This little change alone caused the image to be displayed without any errors. The puzzle, however, that the image was uncommonly large (we expected it to about ad big as the other scans from the same newspaper) remained unsolved.

defective grayscale TIFF, interpreted as RGB

TIFF displayed correctly

We contemplated implementing plausibility checks for this type of error in checkit_tiff, which would be easily feasible assuming that all Strips in an image are of the same length. The following formula could be used: "StripByteCounts / SamplesPerPixel / RowsPerStrip = ImageWidth". This works best for TIFFs with RowsPerStrip = 1 set; other TIFFs would have to undergo more complex checks, because multiline Strips with byte counts that cannot be divided by the row number without modulo may not contain any padding. Due to this, there may be Rows that are shorter that the previous Rows in the same Strip.

Other possible plausibility checks include:

- The image height is exactly as large as the multiplication product of RowsPerStrip and number of Strips: ImageLength = RowsPerStrip * StripOffsets.Count

- Each StripByteCount must be equally large as the difference of the neighboring StripByteOffsets: StripByteCounts[0] = StripOffsets[1] - StripOffsets[0] (or more general StripByteCounts[n] = StripOffsets[n+1] - StripOffsets[n])

- Each Strip needs to be equally long: StripByteCounts[0] = StripByteCounts[1] = StripByteCounts[2] = ... = StripByteCounts[n]

We discussed these possibilities in a larger group, which made Andreas curious, so he sat down to enhance his tool for finding candidates for former IFDs in TIFFs by some soft search criteria. Furthermore, he created an entirely new tool reads a TIFF and interprets the contents at a given address in a way that ressembles the IFD structure. This way, we were able to identify six former IFDs that hint to older versions of this file and inspect these IFDs a little further. The tools are available at https://github.com/SLUB-digitalpreservation/fixit_tiff/tree/master/src/archeological_tools in source code, they are part of the established tool fixit_tiff.

Pointer to the original IFD0, just like it was stored in the 1st file version

The list of possible IFD addresses as given by our tools looks like this:
# adress,weight,is_sorted,has_required_baseline
0x4a184b0,2,y,y
0x4a241aa,2,y,y
0x4a2fea4,2,y,y
0x4a3bbb0,2,y,y
0x4a478d0,2,y,y
0x4a535ea,2,y,y

We inserted these IFD addresses into the file's IFD0 offset pointer using a Hex Editor. Step by step, using this method, we were able to recreate older versions of the file in an archaeology style of work. In the course of the work we could confirm that the scan was originally saved in RGB. Later, there must have been an error in a grayscale conversion where only the tags PhotometricInterpretation (min-is-black) and BitsPerSample (1) were changed. We were not able to find out if the image data had been altered as well.

Ttiffinfo shows these information from the preusmeable 1st IFD0 version of the RGB image:

Photometric Interpretation: RGB color

Samples/Pixel: 3

Later versions, however, contain the values:

Photometric Interpretation: min-is-black

Samples/Pixel: 1

Also, there have been later files versions where some other tags have been added, altered or deleted (Make, Model and Software).

The error was only even discovered because intellectual checks were in place and the human operator noticed the error in displaying the TIFF (and because she decided to inform our colleague of this oddity!). Also, because checksums are only generated after the processing workflow is completed, we wouldn't have noticed the error by a fixity mismatch. We simply didn't have any checksums yet to compare the image against. In the end, the only proper solution will be a rescan of that newspaper page. However, it's still impressive to see the possibilities that TIF offers to repair seemingly broken images.

former articles on this subject (also available in English):

Restaurierung von kaputten TIFF-Dateien

2018-02-02T04:17:00.003-08:00

(English version below)

Kaputtes TIFF, erste Analyse

Ein Kollege schickte uns dieser Tage eine TIFF-Datei, die sich nicht öffnen liess. ImageMagick meldete:

display-im6.q16: Can not read TIFF directory count. `TIFFFetchDirectory' @ error/tiff.c/TIFFErrors/564.
display-im6.q16: Failed to read directory at offset 27934990. `TIFFReadDirectory' @ error/tiff.c/TIFFErrors/564.

Das Tool tiffinfo gab diese Fehlermeldung zurück:

TIFFFetchDirectory: Can not read TIFF directory count.
TIFFReadDirectory: Failed to read directory at offset 27934990.

Ein Blick mit dem Hexeditor Okteta und aktiviertem TIFF-Profil (welches im Übrigen unter https://github.com/art1pirat/okteta_tiff zu finden ist) zeigt, dass das der Offset-Zeiger, der auf das erste ImageFileDirectory (IFD) verweisen sollte, eine Adresse außerhalb der Datei enthält:


Screenshot Okteta, TIFF mit defektem Verweis auf erstes IFD

Faktisch ist das TIFF damit kaputt. Doch bestimmte Eigenschaften dieses Dateiformates erlauben es, eine Restaurierung zu versuchen.

Nebeneinschub

Für eine gut lesbare Einführung in den Aufbau von TIFF-Dateien sei auf den Blogeintrag "baseline TIFF" verwiesen. In "baseline TIFF - Versuch einer Rekonstruktion" wird auf einige manuelle Plausibilitätsprüfungen eingegangen.

Einen kurzen Überblick liefert auch "nestor Thema: Das Dateiformat TIFF" (zu finden auf http://www.langzeitarchivierung.de/Subsites/nestor/DE/Publikationen/Thema/thema.html)

Finden von IFDs

TIFF bringt ein paar Eigenschaften mit, die den Versuch einer Restaurierung erleichtern. So müssen laut Spezifikation Offsets immer auf gerade Adressen verweisen. Damit halbiert sich schon einmal der Suchraum.

Desweiteren können wir annehmen, dass ein IFD mindestens 4 Tags (oft deutlich mehr) enthält, in der Regel Subfiletype (0x00fe), ImageWidth (0x0100), ImageLength (0x0101) und BitsPerSample (0x0102).

Da ein IFD nach den Tags als letzten Eintrag ein NextIFD Feld enthält, welches entweder auf 0 gesetzt ist oder auf ein weiteres IFD verweist, haben wir bereits einiges an wertvollen Hinweisen zusammen.

Auch die Tageinträge innerhalb des IFD selber folgen einer Struktur. Jeder Eintrag besteht aus 2 Bytes TagId, 2 Bytes FieldType, sowie 4 Bytes Count und 4 Bytes ValueOrOffset (sh. Tag-Aufbau, Artikel "baseline TIFF" auf http://art1pirat.blogspot.de).

In der TIFF-Spezifikation sind für FieldType 12 mögliche Werte definiert, die libtiff kennt 18 Werte. Wir können also für jedes angenommene Tag prüfen, ob die Werte im Bereich 1-18 liegen.

Neben diesen harten Kriterien könnten wir, falls die Notwendigkeit besteht, noch weitere hinzuziehen, zum Beispiel:

Prüfe, ob bestimmte Pflicht-Tags vorhanden sind
Prüfe, ob alle Tags, wie von der Spezifikation gefordert, aufsteigend sortiert sind und keine Dubletten enthalten
Prüfe, ob ValueOrOffset ein Offset sein könnte und damit auf eine gerade Adresse verweist

Sicherlich ließen sich noch weitere Kriterien finden, doch in der Praxis zeigt sich, dass die og. harten Kriterien in der Regel schon ausreichen.

Um die Suche nach diesen nicht händisch vornehmen zu müssen, besitzt das Tool fixit_tiff seit kurzem das Programm "find_potential_IFD_offsets".

Wenn man es mit:

$> ./find_potential_IFD_offsets test.tiff test.out.txt

aufruft, spuckt es in der Datei "test.out.txt" eine Liste von Adressen aus, die potentiell ein IFD sein könnten. Für unsere Datei lieferte es den Wert "0x0008", sprich: das IFD müsste an Adresse 8 anfangen.

Mit Okteta die Datei geladen und geändert, voila!, es sieht gut aus:

Screenshot Okteta, TIFF mit repariertem Verweis auf erstes IFD

Auch tiffinfo ist jetzt etwas glücklicher:

TIFFReadDirectory: Warning, Bogus "StripByteCounts" field, ignoring and calculating from imagelength.
TIFF Directory at offset 0x8 (8)
Subfile Type: (0 = 0x0)
Image Width: 4506 Image Length: 6101
Resolution: 300, 300 pixels/inch
Bits/Sample: 8
Compression Scheme: None
Photometric Interpretation: min-is-black
FillOrder: msb-to-lsb
Orientation: row 0 top, col 0 lhs
Samples/Pixel: 1
Rows/Strip: 6101
Planar Configuration: single image plane
Color Map: (present)
Software: Quantum Process V 1.04.73

Und ImageMagick zeigt sich nun gnädiger:

Ansicht des TIFFs mit repariertem Offset auf IFD

Wie man sieht, ist noch nicht alles repariert, schliesslich meldet auch ImageMagick noch Probleme:

display-im6.q16: Bogus "StripByteCounts" field, ignoring and calculating from imagelength. `TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/912.
display-im6.q16: Read error on strip 4075; got 2706 bytes, expected 4506. `TIFFFillStrip' @ error/tiff.c/TIFFErrors/564.

Doch sollte vorliegend gezeigt werden, dass eine Restaurierung von kaputten TIFF-Dateien durchaus möglich ist.

---------------------------------------------------------------------

Broken TIFF, a first analysis

A colleague recently sent us a TIFF file that he couldn't open. ImageMagick reported:

display-im6.q16: Can not read TIFF directory count. `TIFFFetchDirectory' @ error/tiff.c/TIFFErrors/564.
display-im6.q16: Failed to read directory at offset 27934990. `TIFFReadDirectory' @ error/tiff.c/TIFFErrors/564.

The tool tiffinfo returned the following error:

TIFFFetchDirectory: Can not read TIFF directory count.
TIFFReadDirectory: Failed to read directory at offset 27934990.

A quick investigation in the Hex editor Okteta with the TIFF profile activated (to be found at https://github.com/art1pirat/okteta_tiff) revealed that the offset pointer, which should be pointing to the first ImageFileDirectory (IFD), points to an address that is beyond the end of the file:


screenshot Okteta, TIFF with defective pointer to the 1st IFD

Given that, the TIFF is de facto broken. However, we can leverage certain properties of this file format to try a restoration.

Side note

For a well-readable introduction into the structure of TIFF files, pleases refer to the blog post "baseline TIFF". The article "baseline TIFF - Versuch einer Rekonstruktion" describes some manual plausibility checks.

Another short overview is provided by "nestor Thema: Das Dateiformat TIFF" (to be found at http://www.langzeitarchivierung.de/Subsites/nestor/DE/Publikationen/Thema/thema.html)

Finding IFDs

TIFF comes with a few properties that facilitate restoration attempts. According to the specification, offsets must point to even addresses, which already cuts the search space in half.

Also, we can assume that an IFD contains at least four tags (often significantly more), usually Subfiletype (0x00fe), ImageWidth (0x0100), ImageLength (0x0101) and BitsPerSample (0x0102).

As an IFD's last entry after all the tags is a pointer to the NextIFD, which is either set to 0 or points to another IFD, we already have some useful hints to work with.

The tag entries inside of the IFD follow a strict structure as well. Each entry consists of 2 Bytes TagId, 2 Bytes FieldType, 4 Bytes Count and 4 Bytes ValueOrOffset (also see Tag-Aufbau, Artikel "baseline TIFF" auf http://art1pirat.blogspot.de).

The TIFF specification defines 12 possible values for the FieldType, libtiff knows 18 values. Following that, we can check for each chunk of Bytes that might be a tag if the value is between 1 and 18.

Additionally, we could add some soft criteria to these hard criteria that we already have:

check if certain mandatory tags can be found
check if all tags are sorted in an ascending order and don't contain any duplicates as required by the specification
check is ValueOrOffset can be an actual offset by checking if it points to an even offset

We could think up even more criteria, but practical experience shows that the hard criteria are already sufficient for most of the cases.

In order to avoid having to search for potential IFDs in the files manually, the tool fixit_tiff now comes with the program "find_potential_IFD_offsets".

If it is invoked like:

$> ./find_potential_IFD_offsets test.tiff test.out.txt

it will spew out a list of addresses to the file "test.out.txt" that might potentially mark the beginning of an IFD. For the file from our colleague, it gave us only one value, which was "0x0008". In other words, the IFD should start at address 8.

Now load up the file in Okteta change the pointer to the first IFD right after the TIFF header to the correct address, et voila!, it looks good:

screenshot Okteta, TIFF with repaired pointer to 1st IFD

tiffinfo is now a little happier as well:

TIFFReadDirectory: Warning, Bogus "StripByteCounts" field, ignoring and calculating from imagelength.
TIFF Directory at offset 0x8 (8)
Subfile Type: (0 = 0x0)
Image Width: 4506 Image Length: 6101
Resolution: 300, 300 pixels/inch
Bits/Sample: 8
Compression Scheme: None
Photometric Interpretation: min-is-black
FillOrder: msb-to-lsb
Orientation: row 0 top, col 0 lhs
Samples/Pixel: 1
Rows/Strip: 6101
Planar Configuration: single image plane
Color Map: (present)
Software: Quantum Process V 1.04.73

And even ImageMagick is now a little more gracious:

Ansicht des TIFFs mit repariertem Offset auf IFD

As you can see, not everything has been repaired yet, and ImageMagick is still reporting some problems:

display-im6.q16: Bogus "StripByteCounts" field, ignoring and calculating from imagelength. `TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/912.
display-im6.q16: Read error on strip 4075; got 2706 bytes, expected 4506. `TIFFFillStrip' @ error/tiff.c/TIFFErrors/564.

However, we were able to show that a restoration of broken TIFFs is indeed feasible, and even though some of the data is lost, we still can see a part of what has been a magazine scan.

Hinweis auf interessantes Interview zu FFV1

2017-09-06T08:10:00.001-07:00

Ein äußerst interessantes Interview von Jürgen Keiper mit Peter Bubestinger zur Entstehung und Motivation von Matroska/FFV1 als langzeitarchivfähiges Datenformat für audiovisuelle Medien.

Es ist besonders interessant für Archivare, die wissen wollen, warum FFV1/Matroska ihre Probleme lösen kann. Peter schafft es Sachverhalte einfach und anschaulich zu erklären und kommt (fast) ohne technisches Vokabular aus.

Prädikat: Sehenswert!

Hier der Link zum Video:

https://www.memento-movie.de/2017/08/die-geschichte-eines-codecs-ffv1-in-der-archivwelt/

Bibtag - und 'ne Kleinigkeit gelernt

2017-05-30T13:41:00.002-07:00

Heute hatte ich einen Abstecher zum Bibliothekartag 2017 nach Frankfurt am Main gemacht. Zum einen, um etliche Ex-Kommilitonen zu treffen, zum anderen war ich am Workshop von Yvonne Tunnat von der ZBW zur Formatidentifikation interessiert.

Yvonne hat eine wunderbare, pragmatische Art komplizierte Sachverhalte zu erklären. Wer sie kennenlernen möchte, der nestor-Praktikertag 2017 zur Formatvalidierung hat noch Plätze frei.

Zwei Dinge, die ich mitnehme. Zum einen kannte ich das Werkzeug peepdf noch nicht. Es handelt sich um ein CLI-Programm um eine PDF-Datei zu sezieren und kommt ursprünglich aus der Forensik-Ecke.

Zum anderen gibt es mit Bad Peggy ein Validierungstool um JPEGs zu analysieren.

Eine Diskussion, die immer wieder auftaucht ist die, wie man mit unbekannten Dateiformaten umgeht. IMHO sind diese nicht archivfähig, und wie Binärmüll zu betrachten. Dazu bedarf es aber mal eines längeren Beitrags und einer genaueren Analyse, ob und unter welchen Bedingungen solche Dateien vernachlässigbar sind, oder der long-tail zuschlägt.

BTW., wer am Mittwoch noch auf dem Bibtag ist, schaue mal beim Vortrag unserer Kollegin Sabine zu den Ergebnissen der PDF/A Validierung vorbei.

Über die Idee, ein Langzeitarchiv vermessen zu wollen

2017-05-16T02:13:00.000-07:00

OpenClipart von yves_guillou, sh. Link im Bild

Irgendwann gerät man in einer Organisation an den Punkt, an dem man auf Menschen trifft, die sich den Zahlen verschrieben haben. Menschen, die als Mathematiker, als Finanzbuchhalter oder als Controller arbeiten. Das ist okay, denn Rechnungen wollen bezahlt, Ressourcen geplant und Mittel bereitgestellt werden.

Omnimetrie

Problematisch wird das Zusammentreffen mit Zahlenmenschen dann, wenn diese die Steuerung der Organisation bestimmen. Wenn es nur noch um Kennzahlen geht, um Durchsatz, um messbare Leistung, um Omnimetrie.

Schon Gunter Dueck schrieb in Wild Duck¹: "In unserer Wissens- und Servicegesellschaft gibt es immer mehr Tätigkeiten, die man bisher nicht nach Metern, Kilogramm oder Megabytes messen kann, weil sie quasi einen 'höheren', im weitesten Sinn einen künstlerischen Touch haben. Die Arbeitswelt versagt bisher bei der Normierung höherer Prinzipien."

Zahlen lügen nicht

Schauen wir uns konkret ein digitales Langzeitarchiv an. Mit Forderungen nach der Erhebung von Kennzahlen, wie:

Anzahl der Dateien, die pro Monat in das Archiv wandern,
oder Zahl der Submission Information Packages (SIPs), die aus bestimmten Workflows stammen,

demotiviert man ein engagiertes Archivteam.

Denn diese Zahlen sagen nichts aus. Digitale Langzeitarchive stehen auch bei automatisierten Workflows am Ende der Verwertungskette. Es wäre in etwa so als würde man den Verkauf von Würstchen an der Zahl der Besucher der Kundentoilette messen wollen.

In der Praxis ist es so, dass Intellektuelle Einheiten (IE), die langzeitarchiviert werden sollen, nach dem Grad ihrer Archivfähigkeit und Übereinstimmung mit den archiveigenen Format-Policies sortiert werden.

Diejenigen IEs, die als valide angesehen werden, wandern in
Archivinformationspaketen (AIP) eingepackt in den Langzeitspeicher. Die IEs, die nicht archivfähig sind, landen in der Quarantäne und ein Technical Analyst (TA) kümmert sich um eine Lösung oder weist die Transferpakete (SIP) mit diesen IEs zurück.

Wenn wir einen weitgehend homogenen Workflow, wie die Langzeitarchivierung von Retrodigitalisaten, betrachten, so sollte der größte Bestandteil der IEs ohne Probleme im Langzeitspeicher landen können. In dem Fall kann man leicht auf die Idee kommen, einfach die Anzahl der IEs und Anzahl und Größe der zugehörigen Dateien zu messen, um eine Aussage über den Durchsatz des Langzeitarchivs und die Leistung des LZA-Teams zu bekommen.

Ausnahme Standardfall

Doch diese Betrachtung negiert, dass nicht der Standardfall, wo IEs homogenisiert und automatisiert in das Archivsystem wandern, zeitaufwändig ist, sondern der Einzelfall, in dem sich der TA mit der Frage auseinander setzen muss, warum das IE anders aufgebaut ist und wie man eine dazu passende Lösung findet.

Formatwissen

Was die einfache Durchsatzbetrachtung ebenfalls negiert, ist, dass das Archivteam Formatwissen für bisher nicht oder nur allgemein bekannte Daten- und Metadatenformate aufbauen muss. Dieser Lernprozess ist hochgradig davon abhängig, wie gut die Formate bereits dokumentiert und wie komplex deren inneren Strukturen sind.

Organisatorischer Prozess

Ein dritter Punkt, den ein Management nach der Methode Omnimetrie negiert, ist die bereits im Nestor-Handbuch² formulierte Erkenntnis, dass digitale Langzeitarchivierung ein organisatorischer Prozess sein muss.

Wenn, wie in vielen Gedächtnisorganisationen, die Retrodigitalisate produzieren, auf Halde digitalisiert wurde, und das Langzeitarchivteam erst ein bis zwei Jahre später die entstandenen digitalen Bilder erhält, so kann von diesem im Fehlerfall kaum noch auf den Produzenten der Digitalisate zurückgewirkt werden. Die oft projektweise Abarbeitung von Digitalisierungsaufgaben durch externe Dienstleister verschärft das Problem zusätzlich. Was man in dem Falle messen würde, wäre in Wahrheit keine Minderleistung des LZA-Teams, sondern ein Ausdruck des organisatorischen Versagens, die digitale Langzeitverfügbarkeit der Digitalisate von Anfang an mitzudenken.

Natürlich ist es sinnvoll, die Entwicklung des Archivs auch mit Kennzahlen zu begleiten. Speicher muss rechtzeitig beschafft, Bandbreite bereitgestellt werden. Auch hier gilt, Augenmaß und Vernunft.

² Nestor Handbuch -- Eine kleine Enzyklopädie der digitalen Langzeitarchivierung, Dr. Heike Neuroth u.a., Kapitel 8 Vertrauenswürdigkeit von digitalen Langzeitarchiven, von Susanne Dobratz und Astrid Schoger, http://nestor.sub.uni-goettingen.de/handbuch/artikel/text_84.pdf, S.3

FFV1 - some compression results

2017-04-29T02:13:00.000-07:00

In a pilot we got some retrodigitized films and videos in Matroska/FFV1 format. In the following table I summarized the results:

n/a

film/video	1	2	3	4	5
description	8mm, positive, b/w	8mm, positiv, b/w	16mm, positive, b/w	35mm, combined, color	35mm, combined, color
width	2500	2500	2048	4096	4096
height	1524	1524	1520	3460	2976
bits per pixel	48	48	48	48	48
pxfmt	gbrp16le	gbrp16le	gbrp16le	gbrp16le	gbrp16le
duration in s	12	12	11,459	2,5	2,5
fps	24	24	24	24	24
frames	288	288	275	60	60
original size	6583680000	6583680000	5136682844,16	5101977600	4388290560
compressed size	3861943880	3790690517	3680779719	3908475344	3576745774
compression ratio	1,704	1,736	1,395	1,305	1,226
(DPX size)	6584159232	6584159232	5136841600	5102077440	4388390400
(h264 lossless)	n/a	n/a	n/a	n/a	n/a
(h265 lossless)	3573420309	3559442475	2756504247	3015053822	2992764833
(jp2k lossless)	4589886341	4534014321	3732555539	3869665916	3514687046
with audio	n	n	n	y	n

n/a

film/video	6	7	8	9	10
description	35mm, combined, color	vhs, color	betacam, color	betacam, color	Digi-beta, color
width	4096	720	720	720	720
height	3200	576	576	576	576
bits per pixel	48	20	20	20	20
pxfmt	gbrp16le	yuv422p10le	yuv422p10le	yuv422p10le	yuv422p10le
duration in s	1088,042	280	280	280	280
fps	24	25	25	25	25
frames	26113	7000	7000	7000	7000
original size	2053610510746	7257600000	7257600000	7257600000	7257600000
compressed size	1575415175611	3565437155	3838500934	3449372280	4451325952
compression ratio	1,303	2,035	1,890	2,104	1,630
(DPX size)	2053653333632	17472217728	17429888000	17429888000	17429888000
(h264 lossless)	n/a	n/a	n/a	n/a	n/a
(h265 lossless)	1248031292634	3659828688	3772522257	3442739259	4323623225
(jp2k lossless)	1517117560575	3300899483	3470434177	3150727081	4022908822
with audio	n	y	y	y	y

All files are encoded with FFV1v3 with slices, slice-crc, GOP=1. If audio exists, it is (lin. PCM 48kHz, 16bit) included in compression-size, but not in original size, because original size is calculated by width*height*pits_per_pixel*frames and compression-size is equivalent to filesize. The count of frames is calculated with the duration value of the MKV-files. The files 1 to 5, and 7-10 are first parts of the movies (each 4GB splits).

Hint: Once the project is completed, rights must be clarified. If possible, I will publish the sources.

Update 2017-06-09

added file size for DPX after using "ffmpeg -i input.mkv DPX/frame_%06d.dpx"
added file size for h264 after using "ffmpeg -i input.mkv -c:v libx264 -g 1 -qp 0 -crf 0 output.mkv" (RGB without lossy conversion to YUV not supported yet)
added file size for h265 after using "ffmpeg -i input.mkv -c:v libx265 -preset veryslow -x265-params lossless=1 output.mkv"
added file size for openjpeg2000 after using "ffmpeg -i input.mkv -c:v libopenjpeg output.mkv"

Update 2017-06-29

added sizes for film no 6
in general, the processing time of h265 and jp2k is one magnitude greater than for ffv1

Interpretation

The files 1-3 are all originally b/w. It seems to be that the codec does not decorrelate the color channels. Also the material 1-6 is retrodigitized from film and are noisy. The file 1 is very special. In decoding the FFV1 produces a very high load on the CPU (eight cores at 100%). The most decoding time is spent in method get_rac(). The original film has the highest noise level in contrast to the other files.

I think the compression-ratio difference between video- and film files comes from the different pixel format. A ratio between 1,5 - 2 was expected, but 1,3 is a surprise.

Update 2017-06-09

The reason for high CPU load was, that the digitization service provider has created a file with a framerate of 1000 fps, but the scanner has provided 24 or 25 fps. Therefore 42-40 equal frames was encoded on block.

Nestor - DIN - Workshop "Digitale Langzeitarchivierung", Nachlese

2017-03-30T07:37:00.001-07:00

Gestern fand in den Räumen des DIN e.V. ein Workshop des Kompetenznetzwerkes digitale Langzeitarchivierung nestor und der DIN statt. Dies soll nur eine kleine Zusammenfassung für die Zuhausegebliebenen sein und erhebt keinen Anspruch auf ein objektives oder gar vollständiges Protokoll :)
Falls Fehler vorliegen bitten wir um eine Email mit Korrekturhinweisen ;)

Arbeiten des NID 15 Ausschuß

Im Kern ging es im Workshop um die Frage, welchen Standard wollen wir in der digitalen Langzeitarchivierung in den nächsten 5-8 Jahren haben und wie kommen wir dahin?

Mit dieser Frage startete Prof. Keitel den Workshop und skizzierte nachfolgend die Ausgangslage von 2005.

abstraktes Thema "digitale Archivierung"
DIN 31646/31644/31645 aus Nestor "Dunstkreis"
DIN 31647 "Beweiserhaltung kryptograf. signierter Dokumente"
Rücklauf, ob Norm in Praxis verwendet werden ist schwierig zu erkennen
beziehen sich auf OAIS (ISO14721)
zeigen, ob man sich noch im Rahmen der digitalen LZA bewegt.

Aktuell ergänzen praktische Erfahrungen diese frühen theorethischen Überlegungen. Die Frage ist daher, ob es Bereiche gibt, wo sich die Ausgangsthesen mittlerweile überholt haben?

Es gilt, so Prof. Keitel,

Schwerpunkte, die sich zur Standardisierung eignen, herauszukristallisieren
Mitarbeitern zu finden, die sich in der Normierungsarbeit in den neuen Feldern einbringen wollen

Ob man für Normungsarbeit geeignet sei, läßt sich launisch an folgenden Kriterien festmachen (Zitat):

Lange auf Stuhl sitzen
Verbessere gern Geschriebenes anderer Leute
bei genauen terminologischen Definitionen verstehe ich keinen Spaß und mache keine Kompromisse
ich lese gerne Dokumente mit Titelen, wie...

Im Anschluss wurde die Schwierigkeit angesprochen, Feedback zu bestehenden DIN Normen zu erhalten.

PDF Standardisierung

Olaf Drümmer von der callas software GmbH skizzierte einführend die Geschichte von PDF und wies auf die neue Version 2 hin:

1993-2006 Adobe PDF 1.0 -> 1.7
2008 ISO: PDF 1.7 als ISO 32000-1
2017 ISO: PDF 2.0 als ISO 32000-2 (im nächsten Quartal, >1000 Seiten)

neue kryptografische Verfahren
tagging überarbeitet
Problemfeld im Normungsprozess waren Farben
Namespaces wurden eingeführt, zB. um Tags aus HTML 5 einbinden

Er ging dann auf die PDF-Spezialisierungen ein:

2001 PDF/X Übermittlung von Druckvorlagen
2005 PDF/A Archivierung, ISO Reihe 19005

entstanden aus Notwendigkeiten der US Courts, Library of Congress

2008 PDF/E ISO 24517, Engineering (CAD), noch nicht stark verbreitet, Ende des Jahres auch 3D Modelle
2010 PDF/VT ISO 16612-2 + PDF/VCR ISO 16612-3, variabler Datendruck (großvolumige Rechnungen, Serienbriefe)
2012 PDF/UA ISO 14289 Reihe, Barrierefreiheit

Die Bedeutung der Normung ergibt sich nach Drümmer allein schon aus der
Verbreitung von PDF Dokumenten:

Anzahl PDF Dokumente weltweit, mind. Billionen (10¹²), davon 6 Millionen allein beim US Court
Lebenserwartungen pro PDF: Stunden bis Jahre

Weiter ging er auf die Herausforderung Variantenvielfalt ein:

PDF/X, 8 Normteile, insgesamt 12 Konformitätsstufe
PDF/A Normenreihe, 3 Normteile, insgesamt 8 Konformitätsstufen
Unübersichtlich, mangelnde Trennschärfe?
Flexibilität bzw. Mächtigkeit
offener Charakter
breite Abdeckung

Wie es mit der Normierung ab 2017 weitergehen soll skizzierte er anschliessend:

PDF2.0 weitgehend rückwärtskompatibel, keine Validierung bei Veröffentlichungen vorgesehen
Projekt "Camelot2" soll klassische PDF-Dokumentenwelt und Open Web Platform zusammenbringen, mehr Infos zu PDF Days Europe 2017, Berlin, 15.-16. Mai 2017
PDF/A4 als Ziel: keine Konformitätsstufen
PDF/E erlaubt interaktive Elemente (JS), PDF/E-2 soll eher eine Archivausprägung weniger eine Arbeitsdokumentausprägung bekommen
XMP kann im PDF an *allen* Stellen angebracht werden, so dass man darin auch Quellen oder zB. UUIDs dafür hinterlegen kann
PDFA/3 kann auch alternative Verknüpfung zum Inhalt beliebiger Dateien hinterlegen, Problem: nicht verpflichtend und muss über Policy geregelt werden

nestor

Prof. Keitel skizzierte kurz die Arbeit von nestor:

…ist auf jeden Fall Kooperationsnetzwerk
stellt AGs vor

Vertrauenswürdige Archive

* 2004-2008 Nestor Kriterienkatalog
* 2008-2012 DIN31644
* 2013-… nestor Siegel

Submission Information Packages - Überarbeitung der Ingest-Standards

Dr. Sina Westphal und Dr. Sebastian Gleixner (Dt. Bundesarchiv) regten in einem Impulsvortrag die Normierung des Ingestvorgangs und der SIPs an.

Bundesarchiv 4PB/Jahr Zuwachs
Anreiz zur allmählichen Angleichung der Systeme
vereinheitlichte Metadaten
verbesserter Datenaustausch
vereinheitlichte Schnittstellen

Konsequenzen:

Vereinheitlichung bestehender SIPs (ggf. auch AIPs/DIPs)
Vereinheitlichung bestehender digitaler Archivsysteme

Zwei Teilbereiche:

Standardisierung des SIP (konkret)

Struktur
Metadaten
Primärdaten
vgl. E-ARK, e-CH, EMEA

Standardisierung des Ingest-Prozesses (abstrakt)

Verbindung zum Erschliessungstool
Validierung
Ingest
Umgang mit Primärdaten

Fragen:

Vereinheitlichung möglich?
Ist Standardisierung AIPs/DIPs und der damit verbundenen Prozesse notwendig?

Im Anschluss erfolgte eine Diskussion über Abgrenzung und konkrete Austauschverfahren mit ff. Ergebnis:

Trend geht hin zu abstrakter Modulbeschreibung
konzeptioneller Rahmen erwünscht
Festlegung welche Module verpflichtend, welche optional sind
empfohlener Einstiegspunkt für Automatisierung

Videoarchivierung als neue Herausforderung, Langzeiterhaltung audiovisueller Medien jenseits von Film- und Fernsehen

In diesem Impulsvortrag von Alfred Werner, HUK Coburg wurde die Problematik der Langzeitarchivierung von Videos skizziert.

Bandbreite Außenstelle 5-15MBit/s
wandeln in Multipage-TIFF monochrom (kleine Dateien) und in JPG um,
Videos erwünscht,

2011 5 Videos/Tag
2016 20 Videos/Tag (im Gegensatz zu 10.000 Schadensfälle pro Tag)
2021 100?/1000? Videos/Tag

Dashcam-Videos seit diesem Jahr erlaubt

Problem: unterschiedlichste Formate, Tendenz steigend, es wird nicht besser (3D, HDR, 4k, 2 Objektive, Spezialsensoren)

mögliche Lösung: Konvertierung in ein Langzeitarchivformat für Videos

Anforderungen:

Standard für die nächsten 50 Jahre
Lizenzfrei
bestmögliche Qualität
geringer Speicherplatz
gute Antwortzeiten auch bei geringer Bandbreite

dann noch Funktionen für Sachbearbeiter, wie:
Zoomen, Sprungmarken setzen, Extrahieren Einzelbilder, Schwärzen, Szenen extrahieren.

In der anschliessenden Diskussion wurde das Problem deutlich, dass man sich im Spannungsfeld zwischen Robustheit und originalgetreuer Wiedergabe einerseits und Ressourcenbedarf (Speicher, Bandbreite, Processingzeit) andererseits befindet.

Anmerkung: Dazu wurde auf der nestor-ML ein ergänzender Beitrag verfasst.

Digital Curation

Auch hier hielt Prof. Keitel ein Impulsreferat. Ich hoffe, ich kann den Inhalt korrekt wiedergeben:

Unterschied Data Curation zu Langzeitarchivierung nach OAIS: wir reden nicht mehr von Einrichtungen/Organisationen, sondern von Techniken. D.h., fehlen der organisatorischen Verantwortung.

OAIS goes Records Managment, dh. wie kann man Anforderungen der digitalen LZA an Produzenten bringen (durch digital curation), AIP liegt quasi beim Produzenten.
Wie harmonieren die von OAIS/PREMIS genannten Erhaltungsfunktionen mit den Rgelungen des Records Managment? Welche Elemente/Gruppen müssen wir aus Erhaltungsgründen unterscheiden?

Keitel: "Wir gingen bisher immer von einem Kümmerer aus, der Dinge auf Dauer bewahrt. Digital Curation setzt vorher beim Producer an"

Zusammenfassung

Aus unserer Sicht sollte der Ingest versucht werden besser zu standardisieren. Nur so wäre es möglich, dass man Produzenten Werkzeuge in die Hand geben kann, die nicht archivspezifisch sind. Der Weg dorthin ist steil, zumal allein schon die Wege die Archive und Bibliotheken einschlagen sehr unterschiedlich sind.

PDF ist und bleibt leider ein Minenfeld. Weder wurden mit PDF2 bestehende Ambiguitäten ausgeräumt, noch vereinfacht sich der Standard. Besonders nachteilich dürfte sich die fehlende offizielle Validierung erweisen. Hinzukommt dass der Formatzoo rund um PDF weiter anwächst und Mischformen von Dokumenten möglich sind, d.h. ein PDF kann sowohl PDF/E als auch PDF/A sein.

Der Bedarf nach langzeittauglichen Videoformaten ist vorhanden. Eine Normierung könnte helfen, die Unterstützung durch Hersteller zu forcieren. Am Thema Video wurde deutlich, dass die digitale Langzeitarchivierung Kosten verursacht, die nicht leicht zu vermitteln sind. Datenkompression, insbesondere die verlustbehaftete führt zu einem höheren Schadensrisiko bei Bitfehlern. Die Diskussion über das Spannungsfeld Robustheit/Qualität vs. Kosten muss in der Community geführt werden, ist aber außerhalb von Normungsbemühungen anzusiedeln.

Data Curation ist eine Aktie für sich. Es gibt Lücken, die entstehen, wenn Dokumente Lebenszyklen von mehreren Jahrzehnten aufweisen. Mein Bauchgefühl sagt mir, dass dies ebenfalls unter Langzeitverfügbarkeit subsummiert werden kann, da wir in der Langzeitarchivierung ja die Dokumente auf unbestimmte Zeiten nutzbar halten wollen. Data Curation scheint mir demnach nichts anderes als der Sonderfall zu sein, als das Produzent und Archiv als Rolle zusammenfallen.